Code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in Video".

Overview

Consistent Depth of Moving Objects in Video

teaser

This repository contains training code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in Video".

This is not an officially supported Google product.

Installing Dependencies

We provide both conda and pip installations for dependencies.

  • To install with conda, run
conda create --name dynamic-video-depth --file ./dependencies/conda_packages.txt
  • To install with pip, run
pip install -r ./dependencies/requirements.txt

Training

We provide two preprocessed video tracks from the DAVIS dataset. To download the pre-trained single-image depth prediction checkpoints, as well as the example data, run:

bash ./scripts/download_data_and_depth_ckpt.sh

This script will automatically download and unzip the checkpoints and data. If you would like to download manually

To train using the example data, run:

bash ./experiments/davis/train_sequence.sh 0 --track_id dog

The first argument indicates the GPU id for training, and --track_id indicates the name of the track. ('dog' and 'train' are provided.)

After training, the results should look like:

Video Our Depth Single Image Depth

Dataset Preparation:

To help with generating custom datasets for training, We provide examples of preparing the dataset from DAVIS, and two sequences from ShutterStock, which are showcased in our paper.

The general work flow for preprocessing the dataset is:

  1. Calibrate the scale of camera translation, transform the camera matrices into camera-to-world convention, and save as individual files.

  2. Calculate flow between pairs of frames, as well as occlusion estimates.

  3. Pack flow and per-frame data into training batches.

To be more specific, example codes are provided in .scripts/preprocess

We provide the triangulation results here and here. You can download them in a single script by running:

bash ./scripts/download_triangulation_files.sh

Davis data preparation

  1. Download the DAVIS dataset here, and unzip it under ./datafiles.

  2. Run python ./scripts/preprocess/davis/generate_frame_midas.py. This requires trimesh to be installed (pip install trimesh should do the trick). This script projects the triangulated 3D points to calibrate camera translation scales.

  3. Run python ./scripts/preprocess/davis/generate_flows.py to generate optical flows between pairs of images. This stage requires RAFT, which is included as a submodule in this repo.

  4. Run python ./scripts/preprocess/davis/generate_sequence_midas.py to pack camera calibrations and images into training batches.

ShutterStock Videos

  1. Download the ShutterStock videos here and here.

  2. Cast the videos as images, put them under ./datafiles/shutterstock/images, and rename them to match the file names in ./datafiles/shutterstock/triangulation. Note that not all frames are triangulated; time stamp of valid frames are recorded in the triangulation file name.

  3. Run python ./scripts/preprocess/shutterstock/generate_frame_midas.py to pack per-frame data.

  4. Run python ./scripts/preprocess/shutterstock/generate_flows.py to generate optical flows between pairs of images.

  5. Run python ./scripts/preprocess/shutterstock/generate_sequence_midas.py to pack flows and per-frame data into training batches.

  6. Example training script is located at ./experiments/shutterstock/train_sequence.sh

Comments
  • question about the Pre-processing

    question about the Pre-processing

    Can you provide the code for preprocessing part? I wonder for dynamic video, how to get accurate camera pose and K? I see you use DAVIS for example, I want to know how to deal with other videos in this dataset.

    opened by Robertwyq 11
  • Parameter finetuning vs Output finetuning

    Parameter finetuning vs Output finetuning

    It seems that running gradient descent for the depth prediction network makes up the majority of the runtime of this method. The current MiDaS implementation (v3?) contains 1.3 GB of parameters, most of which are for the DPT-Large (https://github.com/isl-org/DPT) backbone.

    In your research, did you experiment with performance differences between 'parameter finetuning' and just simple 'output finetuning' for the depth predictions (like as discussed in the GLNet paper (https://arxiv.org/pdf/1907.05820.pdf))?

    I would also be curious about whether as a middle ground, maybe just finetuning the 'head' of the MiDaS network would be sufficient, and leave the much larger set of backbone parameters locked.

    Thanks!

    opened by carsonswope 0
  • How to get the triangulation files for customized videos?

    How to get the triangulation files for customized videos?

    Thanks for sharing this great work!

    I was wondering how to obtain the triangulation files when using my own videos. For example, the dog.intrinsics.txt, dog.matrices.txt, and the dog.obj.

    Are they calculated from colmap? Could you please provide some instructions to get them?

    opened by Cogito2012 0
  • Question about the colmap parameter setting and image resize need to convert the camera pose

    Question about the colmap parameter setting and image resize need to convert the camera pose

    This is very useful work, thanks. I use colmap automatic_reconstructor --camera_model FULL_OPENCV to process the dog training set in DAVIS to get the camera pose, then replacing ./datafiles/DAVIS/triangulation/, other training codes have not changed, but the depth result of each frame has become much worse. How to set the specific parameters of colmap preprocessing? In addition, the image is resized to a small image during training, does the camera pose information obtained by colmap need to be transformed according to resize?

    opened by mayunchao1994 2
  • Question about triangulation results file

    Question about triangulation results file

    This is a great project, Thanks for your work. I have download triangulation results from your link, but i only found dog.intrinsics.txt and train.intrinsics.txt, In DAVIS-2017-trainval-Full-Resolution.zip file, There are 90 files in it, I was wondering if you could share all the triangulation files about Davis and ShutterStock dataset, Thanks very much.

    opened by aiforworlds 0
  • Can not reproduce training result

    Can not reproduce training result

    As it has been mentioned in issue #9 "DAVIS datafiles uncomplete": "datafiles.tar in provided "Google Drive" download link consists only triangulation data. There are no "JPEGImages/1080p" and "Annotation//1080p" folders that "python ./scripts/preprocess/davis/generate_frame_midas.py" refers to." So, I manually downloaded missing data from https://data.vision.ee.ethz.ch/csergi/share/davis/DAVIS-2017-Unsupervised-trainval-Full-Resolution.zip After that the structure as follow:

    ├── datafiles
        ├── DAVIS
            ├── Annotations  --- missing in supplied download links, downloaded manually from DAVIS datasets 
                ├── 1080p
                    ├── dog
                    ├── train
            ├── JPEGImages  --- missing in supplied download links, downloaded manually from DAVIS datasets 
                ├── 1080p
                    ├── dog
                    ├── train
            ├── triangulation -- data from supplied link
    

    Only after that I could successfully performed all steps of suggested in "Davis data preparation":

    1. Run python ./scripts/preprocess/davis/generate_frame_midas.py.
    2. Run python ./scripts/preprocess/davis/generate_flows.py
    3. Run python ./scripts/preprocess/davis/generate_sequence_midas.py

    However still couldn't reproduce the presented result, running: bash ./experiments/davis/train_sequence.sh 0 --track_id dog

    Output & Stacktrace:

    
    D:\dynamic-video-depth-main>bash ./experiments/davis/train_sequence.sh 0 --track_id dog
    python train.py --net scene_flow_motion_field --dataset davis_sequence --track_id train --log_time --epoch_batches 2000 --epoch 20 --lr 1e-6 --html_logger --vali_batches 150 --batch_size 1 --optim adam --vis_batches_vali 4 --vis_every_vali 1 --vis_every_train 1 --vis_batches_train 5 --vis_at_start --tensorboard --gpu 0 --save_net 1 --workers 4 --one_way --loss_type l1 --l1_mul 0 --acc_mul 1 --disp_mul 1 --warm_sf 5 --scene_lr_mul 1000 --repeat 1 --flow_mul 1 --sf_mag_div 100 --time_dependent --gaps 1,2,4,6,8 --midas --use_disp --logdir './checkpoints/davis/sequence/' --suffix 'track_{track_id}_{loss_type}_wreg_{warm_reg}_acc_{acc_mul}_disp_{disp_mul}_flowmul_{flow_mul}_time_{time_dependent}_CNN_{use_cnn}_gap_{gaps}_Midas_{midas}_ud_{use_disp}' --test_template './experiments/davis/test_cmd.txt' --force_overwrite --track_id dog
      File "train.py", line 106
        str_warning, f'ignoring the gpu set up in opt: {opt.gpu}. Will use all gpus in each node.')
                                                                                                 ^
    SyntaxError: invalid syntax
    

    Noticed that there is no folder named ".checkpoints"

    Similar issue has been mentioned in issue #8 "SyntaxError: invalid syntax"

    Specs: Windows 10 Anaconda: conda 4.11.0 Python 3.7.10 GPU 12Gb Quadro M6000 All specified dependencies including RAFT are installed

    opened by makemota 0
  • DAVIS datafiles uncomplete?

    DAVIS datafiles uncomplete?

    "datafiles.tar" in provided "Google Drive" download link consists only triangulation data. There are no "JPEGImages/1080p" and "Annotation//1080p" folders that "python ./scripts/preprocess/davis/generate_frame_midas.py" refers to:

    ---
    data_list_root = "./datafiles/DAVIS/JPEGImages/1080p"
    camera_path = "./datafiles/DAVIS/triangulation"
    mask_path = './datafiles/DAVIS/Annotations/1080p'
    ---
    
    opened by semel1 1
Releases(sig2021_code_release)
Owner
Google
Google ❤️ Open Source
Google
Code for A Volumetric Transformer for Accurate 3D Tumor Segmentation

VT-UNet This repo contains the supported pytorch code and configuration files to reproduce 3D medical image segmentaion results of VT-UNet. Environmen

Himashi Amanda Peiris 114 Dec 20, 2022
QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper)

QAHOI QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper) Requirements PyTorch = 1.5.1 torchvision = 0.6.1 pip install -r requ

38 Dec 29, 2022
[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision Kehong Gong*, Bingbing Li*, Jianfeng Zhang*, Ta

256 Dec 28, 2022
OpenDelta - An Open-Source Framework for Paramter Efficient Tuning.

OpenDelta is a toolkit for parameter efficient methods (we dub it as delta tuning), by which users could flexibly assign (or add) a small amount parameters to update while keeping the most paramters

THUNLP 386 Dec 26, 2022
EGNN - Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication. This

Phil Wang 259 Jan 04, 2023
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

Yifan Zhang 259 Dec 25, 2022
GemNet model in PyTorch, as proposed in "GemNet: Universal Directional Graph Neural Networks for Molecules" (NeurIPS 2021)

GemNet: Universal Directional Graph Neural Networks for Molecules Reference implementation in PyTorch of the geometric message passing neural network

Data Analytics and Machine Learning Group 124 Dec 30, 2022
Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences

Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences This repository is an official PyTorch implementation of Neighbor

DIVE Lab, Texas A&M University 8 Jun 12, 2022
Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT CheXbert is an accurate, automated dee

Stanford Machine Learning Group 51 Dec 08, 2022
[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs

GAN Compression project | paper | videos | slides [NEW!] GAN Compression is accepted by T-PAMI! We released our T-PAMI version in the arXiv v4! [NEW!]

MIT HAN Lab 1k Jan 07, 2023
Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

UTNet (Accepted at MICCAI 2021) Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation Introduction Transf

110 Jan 01, 2023
Some pre-commit hooks for OpenMMLab projects

pre-commit-hooks Some pre-commit hooks for OpenMMLab projects. Using pre-commit-hooks with pre-commit Add this to your .pre-commit-config.yaml - rep

OpenMMLab 16 Nov 29, 2022
Repository for open research on optimizers.

Open Optimizers Repository for open research on optimizers. This is a test in sharing research/exploration as it happens. If you use anything from thi

Ariel Ekgren 6 Jun 24, 2022
traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

traiNNer traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation to

202 Jan 04, 2023
This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation).

FlatGCN This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation, submitted to ICASSP2022). Req

Dreamer 2 Aug 09, 2022
IEEE Winter Conference on Applications of Computer Vision 2022 Accepted

SSKT(Accepted WACV2022) Concept map Dataset Image dataset CIFAR10 (torchvision) CIFAR100 (torchvision) STL10 (torchvision) Pascal VOC (torchvision) Im

1 Nov 17, 2022
Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Megaverse Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research. The efficient design of the engine enables ph

Aleksei Petrenko 191 Dec 23, 2022
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

DeepCam Shenzhen 1.4k Jan 07, 2023
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation This the repository for this paper. Find extensions of this w

Zhuoyuan Mao 14 Oct 26, 2022
Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

Event Queue Dialect Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure. Motivation The m

Cornell Capra 23 Dec 08, 2022