[SLAM] ORB-SLAM: a Versatile and AccurateMonocular SLAM System

(내용 계속 추가중입니다)

https://arxiv.org/pdf/1502.00956v2.pdf

IEEE Transactions on Robotics, 2015.

ORB-SLAM 논문과

https://www.youtube.com/watch?v=HvF_7T88CYo

위 유튜브에 업로드 된 리뷰영상 참고하였습니다.

ORB - SLAM 용어 정리

Keyframes : An image stored within the system that contains informational cues for localization and tracking

즉, 지도 생성시 특징이 될만한 위치를 나타내는 frame으로 유의미한 feature를 다수 보유하는 frame이다.

Map points : A point in 3D space that is associated with 1 or more keyframes

즉, frame에서 찾아낸 feature들을 3D 공간에 mapping한 point이다.

Representative ORB desciptor : 각 map point를 포함한 Key frame들의 ORB descriptor중에서 hamming distance가 최소인 ORB descriptor이다.

viewing distance : 각 map point의 max,min 가시 거리

Covisibility Graph : A graph consisting of a Keyframe as a node and edge between Keyframe exists if they share at least 15 common map points

Keyframe들간의 map point 공유 관계를 나타낸 graph

Node : keyframe / Edge : 공유된 map point

Essential Graph : A subgraph of covisibility graph(contains all the nodes) that has at least 100 common map points.

위에 있는 covisibility graph의 핵심 정보만을 저장한 subgraph

Abstract

ORB-SLAM은 단안 카메라를 이용한 real-time에서 작동되는 feature-based SLAM 시스템이다.

크고 작은 indoor, outdoor 환경 모두에서 사용 가능한 SLAM이며, 다른 SOTA의 성능을 내는 monocular SLAM 보다 더욱 좋은 성능을 낸다고 한다.

1. Introduction

Bundle Adjustment는 camera localization과 sparse geometrical reconstruction을 제공하는것으로 알려져있다.

이러한 접근은 real-time application에 적합하지 않는 것으로 여겨짐

Visual SLAM의 목표는 주변 환경을 reconstruct하는 동시에 카메라의 경로를 추정하는 것이다.

오늘날 적당한 연산량을 가지면서 정확한 결과를 낼 수 있도록 하는 SLAM 알고리즘은

1. 선택된 frame(key frame)들에 대해 관측된 scene feature (map point)들을 일치시켜야한다.

2. keyframes의 개수에 비례하여 complexity가 증가하기 때문에, 불필요한 반복을 줄여야한다.

3. 정확한 결과를 도출하기 위한 keyframe와 point의 strong 네트워크 구조

4. non-linear를 다루기 위한 keyframe pose와 point location들의 initial estimation

5. scalability를 개선시키는데 중점을 둔 optimization을 가진 local map

6. real-time에 close loop하기 위해 빠르게 global optimization이 가능한 ability

가 필요하다.

이 논문에서 제안한 ORB-SLAM이 중점적으로 contribute한 내용은 다음과 같다.

- tracking, mapping, relocalization and loop closing과 같이 모든 단계에서 동일한 feature를 사용한다.

이 방법은 system을 더욱 효율적이고, 간단하고 reliable하게 만들어준다.

이때 feature는 ORB feature를 사용하는데, 이는 GPU 없이도 real-time performance를 낼 수 있고, viewpoint나 illumination의 변화에도 좋은 invariance를 가진다는 장점이 있다.

- Essential Graph를 사용하여 pose graph를 최적화 하는데 초점을 맞춘 real-time loop closing 사용

등등...

2. Related work

추후 수정

3. System Overview

A. Feature Choice

가장 main이 되는 design idea 중 하나는 바로 mapping, tracking에 이용되는 동일한 feature들이 place recognition(frame-rate relocalization + loop detection)에서도 이용된다는 것이다.

이는 우리 system에서 효율적이고, 주변 SLAM feature들로부터 depth정보를 보간할 필요가 없어진다는 장점이 있다.

또한 general한 place recognition capabilities를 얻기 위해, rotation invariance를 필요로 하는 데,

이 논문에서는 ORB feature를 채택하였다.

ORB feature는 compute와 match하는데에 매우 빠른 성능을 보이고, viewpoint에 대해서도 invariance하다는 특성도 지니고 있다.

또한 다음 논문에서 이미 place recognition에서 좋은 performance를 내고 있다고 하였다.

해당 논문 링크 : http://webdiis.unizar.es/~jdtardos/papers/2014_IEEE_ICRA_Mur_Tardos.pdf

B. Three Threads : Tracking, Local Mapping and Loop Closing

논문에서 제안된 system에서는 위 그림과 같이 세개의 thread(Tracking, Local Mapping, Loop closing)가 parallel하게 동작한다.

간단하게 정리하자면

Tracking Thread에서는 map에서 현재 자신이 어디있는지에 대한 위치를 찾고, 이를 위해 기존의 지도 정보와 ORB feature를 활용하여 현재의 위치를 수정하는 역할을 수행한다.

Local mapping Thread에서는 각각의 keyframe 정보를 지도에 mapping하는데 이때 2D point에서 3D point로 mapping된다.

Loop closing Thread에서는 camera가 계속 돌아다니면서 정보를 수집하게 되면 drift가 발생하게 되는데 SLAM으로 만들어진 지도정보를 통해 현재의 위치를 교정해주는 역할을 수행한다.

-Tracking

Tracking이란 비디오 영상에서 특정 feature의 위치변화를 추적하는 것으로,

이 논문에서 Tracking은 매 frame마다 camera를 localizing하고 새로운 keyframe이 삽입될때마다 decide하는 역할을 수행한다.

처음 frame에서는 이전 frame과 비교하여 feature matching을 수행하고, motion-only BA를 이용하여 pose를 optimization한다.

만약 tracking이 lost되면 (due to occlusions or abrupt movements), place recognition module은 global relocalization을 수행한다

처음 camera pose와 feature matching을 추정할 때, local visible map은 keyframe의 covisibility graph로부터 만들어 낼 수 있다.

그 후, local map의 point match가 reprojection에 의해 search -> match들을 이용하여 camera pose를 optimize한다.

마지막으로 tracking thread는 new keyframe이 삽입되어야할지 아닐지 결정한다.

-Local Mapping

Local Mapping에서는 new keyframe을 만들어 내고, camera pose를 reconstruction하기 위한 local BA를 수행한다.

new keyframe에서 unmatched한 ORB에 대한 새로운 correspondences들은 covisibility graph에 있는 connected keyframes에서 찾아낼 수 있다.

이러한 creation후에 tracking에서 얻은 정보들을 이용하여, 더 high quality point를 얻기 위해 exigent point culling policy를 사용한다.

local mapping은 culling redundant keyframes하는 역할도 수행한다.

-Loop closing

Loop closing에서는 모든 새 keyframe마다 loop들을 search한다.

만약 loop가 detected되면, 한바퀴 돌면서 누적된 drift들이 저장된 similarity transformation을 계산한다.

C. Map Points, KeyFrames and their Selection

- Map Point pi

3D position Xw,i in the world coordinate system.

The viewing direction ni

A representative ORB descriptor Di

The maximum d max and minimum d min distances at which the point can be observed

-Each keyframe Ki

The camera pose T iw, which is a rigid body transformation that transforms points from the world to the camera coordinate system.

The camera intrinsics, including focal length and principal point

All the ORB features extracted in the frame

Map point와 keyframe들은 generous policy에 의해 만들어진다. 동시에 exigent culling mechanism은 redundant keyframes들 + 잘못 match되거나 tracking에 실패한 map point들을 찾아낸다.

이러한 과정들은 로봇이 돌아다니면서 flexible한 map expansion이 가능하도록 한다.

-> 여러 악조건(rotations, fast movement) 속에서도 robustness한 tracking이 가능하도록 한다.

또한 이러한 방법을 사용해 만들어진 map에서는 PTAM과 비교하였을 때 매우 적은 양의 outlier들이 생기는 것을 확인하였다.

D. Covisibility Graph and Essential Graph

Keyframes들 사이의 Covisibility information은 몇몇 task에서 매우 useful하고, 이러한 covisibility information은 undirected weighted graph로 표현된다.

NODE : keyframe

EDGE : 두 keyframe이 동시에 같은 map point를 관측하는 경우 생성되고, 각 map point의 개수만큼 weight θ의 edge를 가진다.

제대로 된 loop를 만들어내기 위해, graph에 존재하는 loop closing error를 distribute하기 위한 pose graph optimization을 수행한다.

Covisibility graph에 존재하는 모든 edge들을 포함시키지 않기 위해서, Essential graph를 제안한다.

Essential graph는 모든 node(keyframe)들을 포함시키지만, 더 적은 양의 edge들을 포함하도록 하여, 여전히 정확한 result를 도출하는 strong network를 만들어낸다.

이러한 system은 최초의 keyframe으로부터 spaning tree를 점점 증가시켜 나가는데, 이는 가장 적은 edge를 가진 covisibility graph의 subgraph를 제공한다.

새로운 keyframe이 삽입될 때, 가장 많은 point observations들을 공유하는 keyframe과 연결된 tree에 삽입된다.

Essential Graph는 spanning tree + 높은 covisibility를 가진 covisibility graph로부터 edge들의 subset + loop closure edge를 가지고 있고, 이는 카메라의 strong network를 만들어낸다.

E. Bags of Words Place Recognition

이 논문에서는 loop detection과 relocalization을 수행하기 위해 bags of words place recognition module인 DBoW2를 차용하였다.

Vocabulary는 large set of image로부터 추출된 ORB descriptor에서 offline으로 만들어진다.

만약 image가 충분히 많다면, 동일한 vocabulary가 다른 환경조건에서도 좋은 performance를 얻을 수 있다.

이러한 과정은 vocabulary에 들어있는 각각의 visual 단어들의 정보를 담고있는 invert index에 대한 정보를 database에 저장하고, Keyframe이 관측될 때마다 database가 증가한다.

이러한 database는 keyframe이 culling procedure에 의해 deleted될 때 update 된다.

이 과정은 다른 time에 삽입되었지만 같은 place를 관찰한 keyframe들을 포함하지 않는다는 limitation이 있다.

대신에 covisibility graph로 연결된 해당 keyframe들을 grouping 한다.

그 후 가장 높은 score의 75% 이상 점수를 가진 keyframe match들을 return 한다.

4. Automatic Map Initialization

map initialization의 목표는 삼각형의 map point set로 이루어진 두개의 freme 사이의 relative pose를 계산하는 것이다.

이 논문에서는 두가지의 parallel geometrical models의 compute를 제안하는데

- homography assuming a planar scene

- fundamental matrix assuming a non-planar scene

알고리즘 소개

1) Find initial correspondences

현재 frame Fc에서 ORB features들을 추출한다.

그 후 reference Fr에서 비교하였을 때 xc ↔ xr 사이의 match들을 search한다.

만약 충분한 match들을 찾지 못했다면, reference frame을 reset한다.

2) Parallel computation of the two models

homography Hcr와 fundamental matrix Fcr를 계산한다.

각각의 model M에서 각각의 iteration에서 score Sm을 계산한다.

3) model selection

만약 scene이 planar, nearly planar or low parallax라면 homography에 의해 설명 가능하다.

fundamental matrix도 역시 발견될 수 있지만, 제대로 constrained 되지 않는다는 문제점과, 잘못된 결과를 낳을 수 있는 fundamental matrix로 recover하려는 시도가 이루어 질 수 있다는 문제점이 존재한다.

4) Motion and Structure from Motion recovery

5) Bundle adjustment

마지막으로 full BA를 수행함으로써, initial reconstuction을 수정한다.

5. Tracking

tracking thread는 카메라로부터 매 frame마다 수행된다.

camera pose optimizations는 motion-only BA

A. ORB Extraction

이 논문에서는 FAST 코너 검출기를 통해 추출한다.

또한 추출된 ORB descriptor는 모든 feature matching에 사용되고, 이에 반해 PTAM은 patch correlation으로 search 한다.

B. Initial Pose Estimation from Previous Frame

만약 마지막 frame에서 성공적으로 tracking이 수행되면, constant velocity motion model(카메라가 등속도로 운동한다는 것을 가정)을 사용하여 camera pose를 예측하고, 마지막 frame에서 관측된 map points의 search를 수행한다.

만약 충분한 match들이 발견되지 않았다면, 마지막 frame에서 해당 position 주위의 map point들에 대해 더욱 wider한 search를 진행한다.

C. Initial Pose Estimation via Global Relocalization

만약 tracking이 lost되었다면, 우리는 frame을 bag of word로 변환한다.

또한 각각의 keyframe마다 map point와 연관된 ORB의 correspondence를 계산한다.

각각의 keyframe마다 RANSAC iteration을 수행하고, PnP algorithm을 이용하여 camera pose를 찾는다.

만약 충분한 inlier camera pose를 발견하였다면, pose를 optimization하고, candidate keyframe에서의 map point match를 더 search한다.

마지막으로 camera pose가 한번더 optimize되고, 충분한 inlier가 존재할 때 tracking 과정이 계속된다.

D. track local map

더 많은 map point 대응쌍을 찾기 위해 local map을 frame에 투영시켜 pose optimization 수행

camera pose와 initial set of feature matches의 estimation이 이루어졌을 때, 만들어진 map을 freme에 투영할 수 있고, 더욱 많은 map point correspondences를 search할 수 있다.

local map은 현재 frame에서의 map point를 공유하는 keyframe set K1과 covisibility graph에서 K1과 neighbor set인 K2를 가진다.

또한 reference keyframe인 Kref도 가지고 있는데, 이 때 Kref는 K1의 부분집합이고, 현재 frame에서의 map points를 가장 많이 공유하는 keyframe이다.

K1과 K2에서 찾을 수 있는 each map point들은 다음과 같은 방법으로 현재 frame에서 search할 수 있다.

(추후 수정)

E. New Keyframe Decision

이 과정은 현재 frame이 new keyframe으로 제시되었을 때의 마지막 step이다.

다음 4가지 조건을 모두 충족해야 key frame으로 결정하게 되는데

1. 마지막 global relocalization후, 최소 20frame 이후의 frame을 선택해야함(이보다 더 짧은 frame 이후를 선택하는 것은 너무 유사하기 때문에 의미가 없음)

2. Local mapping 수행이 적합하거나 마지막 keyframe 추가 후, 20 frame 이후의 frame

3. 현재 frame이 최소 50개 이상의 feature point을 track하고 있어야함

4. 현재 frame이 reference keyframe의 90%이하의 특징점을 track하고 있어야 함(이보다 더 짧은 frame 이후를 선택하는 것은 너무 유사하기 때문에 의미가 없기 때문일 것이라고 생각)

6. Local Mapping

여기서 매 새로운 keyframe Ki 에 대해 local mapping을 수행한다.

A. Keyframe Insertion

tracking thread에서 생성된 new keyframe의 bags of words를 계산하고, 이는 triangulating new point에 대한 data association을 돕는다.

이때, 새로운 keyframe과 원본 영상 frame에서의 map point를 비교한다.

맨 처음 covisibility graph를 update할 떄, Ki에서 new node를 추가하고, 다른 keyframe와 map point를 공유하는 결과를 가지는 edge를 update한다.

또한 가장 많은 point를 가진 keyframe을 가지고 있는 Ki와 linking된 spanning tree를 update한다.

B. Recent Map Points Culling

tracking 과정 후에 생긴 map point들과 새로 insert된 keyframe을 비교하여 bad point를 제거해야한다.

즉, Map points들은 trackable하고, not wrongly triangulated 함을 보장하여야 하기 때문에 엄격한 test를 거쳐야한다.

두가지 조건이 존재

(조건에 대한 부분은 추후 공부)

이러한 현상은 keyframe이 culled되거나 local BA가 outlier observation이라고 판단한 경우에 발생한다.

따라서 Tracking 과정 후에 생긴 map point들과 새로 삽입된 key frame에서의 map point를 비교하여 bad point를 제거한다.

위 과정은 만들어진 map에 outlier를 매우 적게만 남길 수 있도록 한다.

C. New Map Point Creation

새로운 map point는 covisibility graph에 있는 connected keyframes Kc로부터 ORB triangulating(삼각측량법)에 의해 만들어진다.

즉, 앞에서 수정된 covisibility graph를 이용하여 현재 frame과 연결되어 있는 frame들을 찾는다.

이후 찾은 frame과 현재 frame의 map point쌍을 이용하여 triangulate(삼각측량법 사용)한다.

triangulate 과정 후 생긴 frame의 map point를 생성하고, 이 map point 중 현재 frame과 covisibility graph 상에서 연결되어 있는 map point 들만 project 시킨다.

마지막으로 covisibility graph를 다시 update한다.

각각 Ki에 있는 unmatched ORB에 대해 다른 keyframe에 있는 unmatched point와 match가 되는지 search해보아야 한다.

D. Local Bundle Adjustment

위 과정에서 만들어진 map point의 개수를 최적화하는 과정이다.

local BA는 현재 만들어진 keyframe Ki, Ki와 covisibility graph에서 연결된 모든 keyframe Kc, 그러한 keyframe에 의해 발견된 모든 map points 들을 optimize한다.

현재 keyframe과 covisibility 관계에 있는 keyframe을 찾고, 각 frame들을 비교하여 같은 곳을 표시하는 map point를 찾아 연결한다.

연결되지 못한 map point들은 outlier로 간주하고, outlier로 marked된 observation은 optimization 중간이나 마지막 과정에서 삭제된다.

E. Local Keyframe Culling

compact한 reconstruction을 유지하기 위해 local mapping은 redundant된 keyframes들을 탐지하고 제거한다.

모든 keyframe들은 적어도 3개 이상의 다른 keyframe의 map point들과 90%이상 연결되어 있어야 한다.

(추후 수정)

7. Loop Closing

loop closing thread는 Ki를 가져오고, local mapping 과정을 진행한 마지막 keyframe,그리고 탐지하고, loop를 닫도록 노력한다.

Place recognition을 통해 이전에 방문했던 공간인지 확인하는 작업을 거친다.

Place recognition은 bag of words방법을 사용한다.

1. visual vocabulary

많은 이미지들로부터 feature를 추출하고, descriptor를 모은다. 이후 descriptor들을 몇개씩 묶어 clustering한 가운데 값을 vocabulary라 한다.

2. Recognition database

keyframe을 만들 때, visual vocabulary를 사용하여 keyframe에 대한 이미지 특징을 표현한 data를 만들어 저장

만약 이전에 방문한 공간이라고 판단한 경우, 이전에 생성된 지도와 통합하는 기법이다.

A. B : loop detection

C, D : loop correction

A. Loop Candidates Detection

먼저 Ki의 bag of words vector와 그것의 covisibility graph에서의 모든 neighbors 사이의 similarity를 계산한다. 그후 lowest score Smin을 구한다.

그 후, recognition database에 물어보고, Smin보다 낮은 score를 가진 모든 keyframes들을 제거한다.

위 과정을 거친후 장소 통합

(추후 추가)

B. Compute the Similarity Transformation

loop를 닫기 위해서는 현재 keyframe과 loop keyframe(loop을 돌면서 축적된 error에 대한 정보를 알려줌) 사이의 similarity transformation을 계산하여야 한다.

먼저 현재 keyframe과 loop candidate keyframe에 있는 ORB map point 사이의 correspondences들을 계산하여야 한다.

이 과정에서 우리는 각각의 loop candidate에서 3D to 3D correspondences들을 얻어야 하기 때문에 각각의 candidate마다 RANSAC iteration을 수행해주어야 한다.