## 1. Abstract

### 1.1. Virtual reality, augmented reality, robotics, and autonomous driving, have recently attracted much attention from both academic and industrial communities, in which image-based camera localization is a key task. However, there has not been a complete review on image-based camera localization.

### 1.2. In this paper, a new and complete classification of image-based camera localization approaches is provided and the related techniques are introduced. Trends for future development are also discussed.

## 2. Background

### 2.1. Recently, virtual reality, augmented reality, robotics, autonomous driving etc., in which image-based camera localization is a key task, have attracted much attention from both academic and industrial community.

### 2.2. This study considers two-dimensional (2D) cameras. The typically used tool for outdoor localization is GPS, which cannot be used indoors. There are many indoor localization tools including Lidar, Ultra Wide Band (UWB), Wireless Fidelity (WiFi), etc.; among these, using cameras for localization is the most flexible and low cost approach.

### 2.3. The image features of points, lines, conics, spheres, and angles are used in image-based camera localization; of these, points are most widely used. This study focuses on points.

### 2.4. Image-based camera localization is a broad topic. We attempt to cover related works and give a complete classification for image-based camera localization approaches. However, it is not possible to cover all related works in this paper due to length constraints. Moreover, we cannot provide deep criticism for each cited paper due to space limit for such an extensive topic.

### 2.5. This study is unique in that it first maps the whole image-based camera localization and provides a complete classification for the topic.

## 3. Overview

### 3.1. Image-based camera localization is to compute camera poses under a world coordinate system from images or videos captured by the cameras. Based on whether the environment is known beforehand or not, image-based camera localization can be classified into two categories: one with known environment and the other with unknown environment.

### 3.2. The approach with unknown environments can be divided into methods with online and real-time environment mapping and those without online and real-time environment mapping. The former is the commonly known Simultaneous Localization and Mapping (SLAM) and the latter is an intermediate procedure of the commonly known structure from motion (SFM).

### 3.3. According to different map generations, SLAM is divided into four parts: geometric metric SLAM, learning SLAM, topological SLAM, and marker SLAM. Learning SLAM is a new research direction recently.

## 4. Reviews on image-based camera localization

### 4.1. Camera pose determination from known 3D space points is called the perspective-n-point problem, namely, the PnP problem. When n = 1,2, there are no solutions for PnP problems because they are under constraints. When n ≥ 6, PnP problems are linear. When n = 3, 4, 5, the original equations of PnP problems are usually nonlinear.

### 4.2. The methods to solve PnP problems with n = 3, 4, 5 focus on two aspects. One aspect studies the solution numbers or multisolution geometric configuration of the nonlinear problems. The other aspect studies eliminations or other solving methods for camera poses.

4.2.1. The methods that focus on the first aspect are as follows. Grunert [5], Finsterwalder and Scheufele [6] pointed out that P3P has up to four solutions and P4P has a unique solution. Fischler and Bolles [7] studied P3P for RANSAC of PnP and found that four solutions of P3P are attainable. Wolfe et al. [8] showed that P3P mostly has two solutions; they determined the two solutions and provided the geometric explanations that P3P can have two, three, or four solutions.

4.2.2. The methods that focus on the second aspect of PnP problems with n = 3, 4, 5 are as follows. Horaud et al. [13] described an elimination method for the P4P problem to obtain a unitary quartic equation. Haralick et al. [14] reviewed six methods for the P3P problem, which are [5,6,7, 15,16,17]. Dementhon and Davis [18] presented a solution of the P3P problem by an inquiry table of quasi-perspective imaging. Quan and Lan [19] linearly solved the P4P and P5P problems.

### 4.3. When n> = 6, PnP problems are linear and studies on them focus on two aspects. One aspect studies efficient optimizations for camera poses from smaller number of points. The other aspect studies fast camera localization from large data.

### 4.4. Learning SLAM is a new topic gaining attention due to the development of deep learning. Studies on pure topological SLAM are decreasing. Marker SLAM is more accurate and stable. In the following, we introduce geometric metric SLAM, learning SLAM, topological SLAM, and marker SLAM.

4.4.1. Geometric SLAM

4.4.1.1. Geometric metric SLAM computes 3D maps with accurate mathematical equations. Based on the different sensors used, geometric metric SLAM is divided into monocular SLAM, multiocular SLAM, and multi-kind sensors SLAM. Based on the different techniques used, geometric metric SLAM is divided into filter-based SLAM and keyframe-based SLAM.

4.4.2. Learning SLAM

4.4.2.1. Learning SLAM is a new topic that gained attention recently due to the development of deep learning. We think it is different from geometric metric SLAM and topological SLAM by a single category. Learning SLAM can obtain camera pose and 3D map but needs a prior dataset to train the network. The performance of learning SLAM depends on the used dataset greatly and it has low generalization ability.

4.4.3. Topological SLAM

4.4.3.1. Topological SLAM does not need accurate computation of 3D maps and represents the environment by connectivity or topology. Kuipers and Byun [130] used a hierarchical description of the spatial environment, where a topological network description mediates between a control and metrical level; moreover, distinctive places and paths are defined by their properties at the control level and serve as nodes and arcs of the topological model.

4.4.4. Marker SLAM

4.4.4.1. In 1991, Gatrell et al. designed a concentric circular marker, which was modified with additional color and scale information in [144]. Kato and Billinghurst presented the first augmented reality system based on fiducial markers known as the ARToolkit, where the marker used is a black enclosed rectangle with simple graphics or text. Naimark and Foxlin [146] developed a more general marker generation method, which encodes a bar code into a black circular region to produce more markers.