Making Structure-from-Motion (COLMAP) more robust to symmetries and duplicated structures

Overview

SfM disambiguation with COLMAP

About

Structure-from-Motion generally fails when the scene exhibits symmetries and duplicated structures. In this repository, we implement several state-of-the-art algorithms that aim at addressing this problem, we integrate them into COLMAP, and we extensively analyze their performance. We hope that this effort can ease further research on this problem.

We focus on filtering out incorrect matches between images prior to SfM. The filtering process is done by reimplementing the ideas from Distinguishing the indistinguishable: Exploring structural ambiguities via geodesic context by Yan et al. (CVPR 2017) and Global Structure-from-Motion by Similarity Averaging by Cui et al. (ICCV 2015). We also include the experiment results from Improving Structure from Motion with Reliable Resectioning by Kataria et al. (3DV 2020) based on their implementation. We refer to these three papers as Yan's method, Cui's method, and Kataria's method respectively. This repository uses COLMAP and hloc for feature extraction and matching, and COLMAP for geometric verification and sparse reconstruction.

TLDR: No method consistently works well over all datasets with a single set of hyperparameters. Tuning the parameters for large scenes is difficult and time-consuming for all three methods.

Drop an email to Lixin Xue if you are interested in this problem and want to chat!

teaser
Results on the Alexander Nevsky Cathedral Dataset

Conclusions

Based on our experiments, we have the following observations:

  • Duplicate structures in images often lead to an excessive number of image matches and non-converging bundle adjustment. These will lengthen the reconstruction time significantly. Removing the wrong matches or initializing the poses correctly can significantly speed up the reconstruction process.
  • With a correct initial pair of images and a perfect next view selection, colmap might still output a reconstruction with many wrongly registered images. Even Kataria's method to initialize poses based on reliable matches was still insufficient to disambiguate some datasets. Therefore, a less noisy pose graph is necessary for a correct reconstruction.
  • Both Yan's method and Cui's method need some scene-specific parameter tuning. Kataria's method has better generalization ability, though still fails on several datasets even though we tune the parameters for it. No method consistently works well over all datasets with a single set of parameters. Tuning the parameters for certain scenes (especially large scale ones) is difficult and time-consuming for all three methods.

Installation

# python 3.7 is required for the 'capture_output' keyword in `subprocess.run`
# which is only used in the notebooks
conda create -n sfm python=3.7 -y

# install [colmap](https://colmap.github.io/install.html).
# install [hloc](https://github.com/cvg/Hierarchical-Localization#installation)

# library for the plot of match graphs
sudo apt-get install graphviz graphviz-dev
pip install pygraphviz
conda install -c anaconda networkx -y
# for interactive control of parameters in the notebooks
conda install -c conda-forge jupyterlab ipywidgets -y
# can skip this if using colmap for visualization of 3D models
conda install -c open3d-admin -c conda-forge open3d -y

# install this library in development mode for further modifications
python install -e .

We also provide the datasets we use via google drive. Please download it and then unzip it under the folder datasets to have the following layout:

|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...

Pipelines

We provide two jupyter notebooks as examples for the complete pipelines of Yan's method and Cui's method. These two methods share a similar pipeline by first computing a score for each image pair and then removing wrong matches based on the score. After that, the filtered matches are passed to the incremental reconstruction stage. In Yan's method, they use raw matches to compute tracks and use the percentage of shared unique tracks between two images as the score for an image pair. While in Cui's method, they do a local reconstruction for every image and use the missing correspondence idea to create a score for every image pair.

pipeline yan pipeline cui
Similar Pipelines for Yan's method and Cui's method

1. Correspondence

First, we can extract features from images using colmap or hloc. We provide the following features with their corresponding keywords in the parentheses:

  1. colmap SIFT with default parameters (sift_default)
  2. colmap sparser SIFT features with the first octave set to be 0 (sift_sparse)
  3. SuperPoint (superpoint)
  4. D2-Net (d2net)
  5. R2D2 (r2d2)
  6. DISK (disk, still a pull request in hloc)

For SIFT features extracted by colmap, we use the exhaustive nearest neighbor matching provided by colmap.

For learned features extracted by hloc, we use the exhaustive nearest neighbor matching provided by hloc. Specifically for the SuperPoint feature, we can also use the trained SuperGlue model for matching.

Then, we use colmap matches_importer to perform geometric verification (compute two-view geometries from the matches) with different RANSAC parameters (check colmap_matching_options in options/matching_options.py).

2. Disambiguation

Next, we can use Yan's method or Cui's method to compute the scores for all the matches. After that, we can choose to use a threshold filter, a top k filter, or a percentile filter to remove suspicious matches. We create a new database with the filtered matches and recompute two-view geometries.

We choose to pre-filter matches rather than post-process the reconstructed model as done in the paper Correcting for Duplicate Scene Structure in Sparse 3D Reconstruction by Heinly et al based on the observations stated in the section Conclusions.

3. Reconstruction

Lastly, we use colmap mapper to reconstruct the whole scene incrementally. Depending on the dataset, you can choose to fix the intrinsics from EXIF or not (check options/mapper_options.py).

Yan's Method

Distinguishing the Indistinguishable: Exploring Structural Ambiguities via Geodesic Context. CVPR 2017.

By Qingan Yan, Long Yang, Ling Zhang, Chunxia Xiao.

Basic Steps

The key idea is that geodesic neighbors capturing the same instance of duplicate structures usually share more matches than images of different instances of duplicate structures. With this in mind, they:

  1. generate tracks from raw matches;
  2. select the iconic set of images to summarize the whole scene by maximizing an objective function that favors completeness (#observed tracks) and penalizes repetitiveness (#tracks occurring in more than one image);
  3. split the tracks covered by the images in the iconic set into two parts: those appearing only in one image of the iconic set are defined as unique tracks, while the other tracks appearing more than once in the images of the iconic set are defined as confusing tracks.
  4. define a score for match_ij: len(unique_ij) / max(len(unique_i), len(unique_j)), i.e. the percentage of common unique tracks shared by the two images

Differences between the Original Implementation and the Paper

Our implementation follows the original implementation shared by the author. However, there are some differences between the author's code and the paper:

  1. In the actual implementation, the non-iconic images are also used as the backbone of the path network. Therefore, the match graph is not "a bipartite graph with nodes respectively are iconic images and non-iconic images". The construction of the iconic set is only for the construction of the unique tracks and confusing tracks.
  2. In the original paper, the match will be kept as long as two images share enough unique tracks. However, the percentage of unique tracks is also used to filter out matches in the implementation.
  3. The $\alpha$ in formula (3) is set to 0 in the code while it is required to be larger than 0 in the paper. As a consequence, the stopping criterion for the construction of the iconic set is different: in the paper the iconic set will be fixed once adding any one of the images will not increase the objective function (3). Instead, the author provides a coverage threshold to stop the expansion of the iconic set once the objective is larger than this threshold. This change is necessary since the objective function will be monotonically increasing with $\alpha = 0$.

Parameters Tuning

The original implementation has two parameters (coverage_thres and score_thres) to tune. Here is the comment from the author:

The coverage controls how many iconic images will be selected. As for small-scale indoor scenes, a large value between 0.7 and 0.9 is recommended; otherwise, for large-scale unstructured datasets, the value around 0.6 would be enough.

Parameter score_thres defines whether an image pair is acceptable. As for small-scale scenes, similarly, a large threshold (around 0.3) is recommended; otherwise, for large-scale outdoor scenes, score_thres between 0.04 and 0.1 would be a good choice.

For instance, for the Alexander Nevsky Cathedral dataset, we use coverage = 0.6 and score_thres = 0.1 to achieve well-registered 3D point clouds.

In our implementation, we expose another 4 parameters to tune:

  • track_degree: the minimal length of a track to be taken into account. Increasing it will discard more short tracks.
  • alpha: the $\alpha$ in formula (3) of the paper. Increasing it will require the images in the iconic set to be more distinctive.
  • minimal_views: the minimal number of shared tracks for a match to be valid. Increasing it means fewer matches will be valid.
  • ds: the data structure used to store the list of matches. You can leave it unchanged (default largearray) as the default value is a good tradeoff between speed and memory. For large datasets with thousands of images like berliner_dom (1618 images), it is necessary to use the smallarray data structure or limit the maximum number of keypoints in an image. In this case, it would be extremely slow (more than 8 hours for berliner_dom) due to a large number of images and the inefficient data structure.

Cui's Method

Global Structure-from-Motion by Similarity Averaging. ICCV 2015.

By Zhaopeng Cui and Ping Tan.

Basic Steps

In this paper, the authors utilize the missing correspondences cue to remove wrong matches.

  1. For each image and its direct neighbors, compute a two-view reconstruction by triangulation with the baseline set to 1.
  2. Merge these two-view reconstructions into one local reconstruction by solving a linear system based on the scale consistency in the stellar graph.
  3. Project the merged 3D points to each neighbor and calculate its missing correspondence score.

For more details, please check Section 4 of the paper.

Missing Correspondence Score

In the paper, there are two types of projected 3D points in the field of view of the image: the observed keypoints (matched features) in this image and the ones observed in other images but not in this image (missing features).

The author takes the bounding boxes of matched features and calculates the percentage of missing points in the bounding boxes as the missing correspondence score. For example, these two images below display the projected 3D points from image 0 into image 1 and image 16, respectively.

projected points from image 0 to image 1 projected points from image 0 to image 17

However, this formulation is not stable since the bounding box is extremely sensitive to outliers: one outlier might enlarge the box to the entire image frame. In addition, the bounding box doesn't take into account the perspective transformation between images, where a square in one image would not be a square in another image.

Therefore, we propose a more fine-grained score by using a small square box around every keypoints instead of a big bounding box around all the keypoints. Below is the visualization of the new masks of the images shown above. With this modification, the score is more reliable and distinctive for the disambiguation.

projected points from image 0 to image 1 projected points from image 0 to image 17

We further try to take the distance between matched features and missing features into account by using a Gaussian mask around the keypoints:

projected points from image 0 to image 1 projected points from image 0 to image 17

This scoring method requires a smaller threshold since the score is typical around ~0.1. We did not experiment much with this scoring method, so we are not sure if it is better than the previous version.

Parameters

  • score_version: we provide three score formulations as stated above. Version 1 is the method using a bounding box around keypoints presented in the paper. Version 2 is using a uniform square around every keypoint. Version 3 is using a Gaussian square around every keypoint.
  • min_num_valid_depths: minimal number of correct reconstructed depths for a 3D point to be valid (The depth consistency check in the paper).
  • max_num_neighbors: maximal number of neighbors used for the reconstruction. It is used to avoid the inefficiency caused by dense clusters of images.
  • square_radius: the size of the square around each keypoint for the score versions 2 and 3.
  • parallel: whether to use multiple threads for the computation of the scores. The statistics of the runtime is not accurate in parallel mode.
  • plot: whether to plot the masks and the image with projected 3D points. For large datasets, it is advisable not to plot as plotting will have a significant overhead.

Visualizations

Fine-tuned Parameters

Now we show some results of the implemented methods. The gif on the upper left displays the images in each dataset. The gif on the upper right is the reconstructions from the colmap with default parameters. The bottom left and the bottom right gifs display the reconstructions after disambiguation with Yan's method and Cui's method, respectively.

One thing worth noticing is that the parameters used for different datasets are different: we kind of cheat by tuning the parameters based on the pose graphs.

Books

Books Dataset Books COLMAP

Books Yan's Method Books Cui's Method

Cereal

Cereal Dataset Cereal COLMAP

Cereal Yan's Method Cereal Cui's Method

Cup

Cup Dataset Cup COLMAP

Cup Yan's Method Cup Cui's Method

Desk

Desk Dataset Desk COLMAP

Desk Yan's Method Desk Cui's Method

(the image on the most left is misregistered with colmap, while it is corrected with either one of the two methods)

Oats

Oats Dataset Oats COLMAP

Oats Yan's Method Oats Cui's Method

(both methods failed as the ground truth should be something like a sequence instead of two sequences in parallel)

Street

Street Dataset Street COLMAP

Street Yan's Method Street Cui's Method

Temple of Heaven

ToH Dataset ToH COLMAP

ToH Yan's Method ToH Cui's Method

Alexander Nevsky Cathedral

Alexander Nevsky Cathedral Dataset Alexander Nevsky Cathedral COLMAP

Alexander Nevsky Cathedral Yan's Method Alexander Nevsky Cathedral Cui's Method

Same Parameters

To investigate to what extent a set of parameters would be applicable for all datasets, we apply the parameters tuned for the Alexander Nevsky Cathedral dataset on other Internet collections of images.

Arc de Triomphe

Berliner Dom

Big Ben

Brandenburg Gate

(With a proper choice of the threshold, we can disambiguate the model into several parts.)

Church of Savior on the Spilled Blood

(With a proper choice of the threshold, we can disambiguate the model into several parts.)

Radcliffe Camera

(The correct reconstruction is split into two parts due to the lack of transitional camera views)

Kataria's Method

Here we would like to also display the results from the paper Improving Structure from Motion with Reliable Resectioning by Rajbir Kataria, Joseph DeGol, Derek Hoiem. For more details, please refer to the repository provided by the authors. We refer to this method as Kataria's method hereafter.

Based on the observation that longer tracks are more likely to contain wrong matches, the authors propose to use a track-length-adjusted number of matches as the criterion for the next view selection. More importantly, the initial pose of the image to be registered will rely only on 3D points from reliable images instead of all triangulated points. This is important as our experiments show that a correct registration order does not necessarily lead to a correct reconstruction. This method only contains two parameters to be set: the track length discount factor λ and the reliable image threshold τ. More significantly, the same set of parameters could work on many different scenes, greatly reducing the burden of tuning parameters for the above mentioned two methods.

For a fair comparison, we investigate the changed files in the original repository and integrate them with the current version of colmap with small modifications. We run the exhaustive_matcher instead of vocab_tree_matcher as done in previous methods. Since the parameters provided by the author are tuned for OpenSfm, we also tried to tune the parameters (λ changed from 0.5 to 0.3, τ changed from 2.0 to 1.3) for colmap on the cup and the oats dataset. The results are shown below:

Cup

(The reconstruction on the left is with the parameters provided by the authors, while the one on the right is with the parameters tuned by us)

Oats

(The reconstruction on the left is with the parameters provided by the authors, while the one on the right is with the parameters tuned by us. Note that we did not find a set of suitable parameters for Yan's or Cui's method to disambiguate this scene)

Results on Large Scale Datasets

However, when we use these two sets of parameters on the large scale Internet datasets provided by Heinly et al, both sets of the parameters give us similar reconstructions and they are somewhat inferior to what we can get from Yan's or Cui's method:

Alexander Nevsky Cathedral

(In the left reconstruction, some of the misregistered cameras should be placed in the blue circle to create a correct reconstruction like the one on the right)

Big Ben

(Note the suspicious wall in the blue circle in the left reconstruction, which should be an empty street as in the right reconstruction)

Radcliffe Camera

(This set of parameters for Kataria's method cannot distinguish the two sides of Radcliffe Camera, while Yan's method and Cui's method work)

Reproduction

For the reproduction of the above results for Kataria's method, we put the changed/added files in the reliable_resectioning folder. You can merge all the files in this directory with colmap's source code and then compile it. We also provide a bash script example for generating sparse reconstruction with the newly compiled colmap.

Codebase Walkthrough

|---datasets
    |---heinly2014
        |---...
    |---yan2017
        |---...
|---disambiguation
    |---geodesic_consistency        # code for Yan's method
    |---mmissing_correspondences    # code for Cui's method
    |---options     # parameters for features/matching/mapper
    |---utils       # some helper functions
    |---calculate_geodesic_consistency_scores.py        # interface for calculating match scores based on Yan's method
    |---calculate_missing_correspondences_scores.py     # interface for calculating match scores based on Cui's method
    |---extract_match_features.py                       # interface for extract and match features
|---reliable_resectioning
    |---src         # modified colmap source files for Kataria's method
|---results
    |---${dataset_name}
        |---${feature_type}_${matching_type}_${geometric_verification_type}
            |---plots_${parameters}             # plots for missing correspondences in Cui's method
            |---sparse                          # reconstruction using colmap without disambiguation
            |---sparse_yan_${parameters}        # reconstruction using Yan's method
            |---sparse_cui_${parameters}        # reconstruction using Cui's method
            |---db_yan_${parameters}.db         # storing matches filtered with Yan's method
            |---db_cui_${parameters}.db         # storing matches filtered with Cui's method
            |---${dataset_name}.db              # storing unfiltered matches
            |---scores_yan_${parameters}.npy    # socres for matches using Yan's method
            |---scores_cui_${parameters}.npy    # scores for matches using Cui's method
            |---...
|---notebooks
    |---$steet_${method_name}.ipynb   # example for running the codebase and tuning the parameters on the street dataset.
|---scripts
    |---disambiguate_yan.py     # example of using Yan's method for scores
    |---disambiguate_cui.py     # example of using Cui's method for scores
    |---filter_matches.py       # example of filtering matches based on scores
    |---match_features.py       # example of extracting and matching features

Datasets

We mainly use the datasets from Yan's repo and Heinly's website, where they include some datasets from Roberts et al. and Jiang et al.. We packed the cleaned-up version of these datasets (with images renamed and features removed) into a zip file for downloads.

To experiment with other datasets, you can place new datasets under yan2017 or heinly2014 with the following structure:

|---datasets
    |---heinly2014
        |---${your_dataset_name}
            |---images
                |--- *.[jpg/png/...]

then you can use your ${your_dataset_name} as argument dataset_name to run the code on new datasets.

Literature Review

Here are some relevant papers and their summaries:

  • Pose Graph: nodes are images, edges are epipolar geometries

  • Visibility Graph: nodes are images and tracks(3D points), edges are visibility

  • Some other papers

    • [Roberts CVPR 2011]: missing correspondences + timestamp cues
    • [Cohen CVPR 2012]: detect symmetries and enforce them as constraints in optimization
    • [Ceylan TOG 2013]: user-specified pattern to detect repeated patterns for facade
    • [Heinly 3DV 2014]: similar idea but faster compared to [Heinly ECCV 2014]
    • [Kataria 3DV 2020]: use track length to adjust match score for next view selection; only use reliable 3D points for image registration
  • Common techniques: Minimum Spanning Tree

Acknowledgement

This work was done by Lixin Xue, a Master's student at ETH Zurich, under the supervision of Paul‑Edouard Sarlin and Mihai Dusmanu. We would like to thank Qingan Yan for sharing the original implementation and datasets, Jared Heinly for sharing the datasets, and Rajbir Kataria for helping out with the setup of his work.

Owner
Computer Vision and Geometry Lab
Computer Vision and Geometry Lab
Code for CVPR2019 Towards Natural and Accurate Future Motion Prediction of Humans and Animals

Motion prediction with Hierarchical Motion Recurrent Network Introduction This work concerns motion prediction of articulate objects such as human, fi

Shuang Wu 85 Dec 11, 2022
PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

Guo-Hua Wang 4 Dec 17, 2022
Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

Transformer-vocabulary-transfer Implementation of the paper "Fine-Tuning Transfo

LEYA 13 Nov 30, 2022
A BaSiC Tool for Background and Shading Correction of Optical Microscopy Images

BaSiC Matlab code accompanying A BaSiC Tool for Background and Shading Correction of Optical Microscopy Images by Tingying Peng, Kurt Thorn, Timm Schr

Marr Lab 34 Dec 18, 2022
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 3.1k Jan 04, 2023
Equivariant layers for RC-complement symmetry in DNA sequence data

Equi-RC Equivariant layers for RC-complement symmetry in DNA sequence data This is a repository that implements the layers as described in "Reverse-Co

7 May 19, 2022
Predicting Price of house by considering ,house age, Distance from public transport

House-Price-Prediction Predicting Price of house by considering ,house age, Distance from public transport, No of convenient stores around house etc..

Musab Jaleel 1 Jan 08, 2022
Awesome Monocular 3D detection

Awesome Monocular 3D detection Paper list of 3D detetction, keep updating! Contents Paper List 2022 2021 2020 2019 2018 2017 2016 KITTI Results Paper

Zhikang Zou 184 Jan 04, 2023
A Lightweight Experiment & Resource Monitoring Tool 📺

Lightweight Experiment & Resource Monitoring 📺 "Did I already run this experiment before? How many resources are currently available on my cluster?"

170 Dec 28, 2022
Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

Damian Panek 176 Nov 28, 2022
CAR-API: Cityscapes Attributes Recognition API

CAR-API: Cityscapes Attributes Recognition API This is the official api to download and fetch attributes annotations for Cityscapes Dataset. Content I

Kareem Metwaly 5 Dec 22, 2022
This repo is customed for VisDrone.

Object Detection for VisDrone(无人机航拍图像目标检测) My environment 1、Windows10 (Linux available) 2、tensorflow = 1.12.0 3、python3.6 (anaconda) 4、cv2 5、ensemble

53 Jul 17, 2022
Airbus Ship Detection Challenge

Airbus Ship Detection Challenge This is an open solution to the Airbus Ship Detection Challenge. Our goals We are building entirely open solution to t

minerva.ml 55 Nov 29, 2022
This is an easy python software which allows to sort images with faces by gender and after by age.

Gender-age Classifier This is an easy python software which allows to sort images with faces by gender and after by age. Usage First install Deepface

Claudio Ciccarone 6 Sep 17, 2022
Official pytorch code for SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal

SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal This is the official pytorch code for SSAT: A Symmetric Semantic-

ForeverPupil 57 Dec 13, 2022
The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

IDEAS Lab 7 Dec 18, 2022
DROPO: Sim-to-Real Transfer with Offline Domain Randomization

DROPO: Sim-to-Real Transfer with Offline Domain Randomization Gabriele Tiboni, Karol Arndt, Ville Kyrki. This repository contains the code for the pap

Gabriele Tiboni 8 Dec 19, 2022
UFPR-ADMR-v2 Dataset

UFPR-ADMR-v2 Dataset The UFPR-ADMRv2 dataset contains 5,000 dial meter images obtained on-site by employees of the Energy Company of Paraná (Copel), w

Gabriel Salomon 8 Sep 29, 2022
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022