In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Overview

Contrastive Learning of Object Representations

Supervisor:

Institutions:

Project Description

Contrastive Learning is an unsupervised method for learning similarities or differences in a dataset, without the need of labels. The main idea is to provide the machine with similar (so called positive samples) and with very different data (negative or corrupted samples). The task of the machine then is to leverage this information and to pull the positive examples in the embedded space together, while pushing the negative examples further apart. Next to being unsupervised, another major advantage is that the loss is applied on the latent space rather than being pixel-base. This saves computation and memory, because there is no need for a decoder and also delivers more accurate results.

eval_3_obj

In this work, we will investigate the SetCon model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. [1] (Paper) The SetCon model has been published in November 2020 by the Google Brain Team and introduces an attention-based object extraction in combination with contrastive learning. It incorporates a novel slot-attention module [2](Paper), which is an iterative attention mechanism to map the feature maps from the CNN-Encoder to a predefined number of object slots and has been inspired by the transformer models from the NLP world.

We investigate the utility of this architecture when used together with realistic video footage. Therefore, we implemented the SetCon with PyTorch according to its description and build upon it to meet our requirements. We then created two different datasets, in which we film given objects from different angles and distances, similar to Pirk [3] (Github, Paper). However, they relied on a faster-RCNN for the object detection, whereas the goal of the SetCon is to extract the objects solely by leveraging the contrastive loss and the slot attention module. By training a decoder on top of the learned representations, we found that in many cases the model can successfully extract objects from a scene.

This repository contains our PyTorch-implementation of the SetCon-Model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. Implementation is based on the description in the article. Note, this is not the official implementation. If you have questions, feel free to reach out to me.

Results

For our work, we have taken two videos, a Three-Object video and a Seven-Object video. In these videos we interacted with the given objects and moved them to different places and constantly changed the view perspective. Both are 30mins long, such that each contains about 54.000 frames.

eval_3_obj
Figure 1: An example of the object extraction on the test set of the Three-Object dataset.

We trained the contrastive pretext model (SetCon) on the first 80% and then evaluated the learned representations on the remaining 20%. Therefore, we trained a decoder, similar to the evaluation within the SetCon paper and looked into the specialisation of each slot. Figures 1 and 2 display two evaluation examples, from the test-set of the Three-Object Dataset and the Seven-Object Dataset. Bot figures start with the ground truth for three timestamps. During evaluation only the ground truth at t will be used to obtain the reconstructed object slots as well as their alpha masks. The Seven-Object video is itended to be more complex and one can perceive in figure 2 that the model struggles more than on the Three-Obejct dataset to route the objects to slots. On the Three-Object dataset, we achieved 0.0043 ± 0.0029 MSE and on the Seven-Object dataset 0.0154 ± 0.0043 MSE.

eval_7_obj
Figure 2: An example of the object extraction on the test set of the Seven-Object dataset.

How to use

For our work, we have taken two videos, a Three-Object video and Seven-Object video. Both datasets are saved as frames and are then encoded in a h5-files. To use a different dataset, we further provide a python routine process frames.py, which converts frames to h5 files.

For the contrastive pretext-task, the training can be started by:

python3 train_pretext.py --end 300000 --num-slots 7
        --name pretext_model_1 --batch-size 512
        --hidden-dim=1024 --learning-rate 1e-5
        --feature-dim 512 --data-path ’path/to/h5file’

Further arguments, like the size of the encoder or for an augmentation pipeline, use the flag -h for help. Afterwards, we froze the weights from the encoder and the slot-attention-module and trained a downstream decoder on top of it. The following command will train the decoder upon the checkpoint file from the pretext task:

python3 train_decoder.py --end 250000 --num-slots 7
        --name downstream_model_1 --batch-size 64
        --hidden-dim=1024 --feature-dim 512
        --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --learning-rate 1e-5

For MSE evaluation on the test-set, use both checkpoints, from the pretext- model for the encoder- and slot-attention-weights and from the downstream- model for the decoder-weights and run:

python3 eval.py --num-slots 7 --name evaluation_1
        --batch-size 64 --hidden-dim=1024
        --feature-dim 512 --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --decoder-path "path/to/decoder.pth.tar"

Implementation Adjustments

Instead of many small sequences of artificially created frames, we need to deal with a long video-sequence. Therefore, each element in our batch mirrors a single frame at a given time t, not a sequence. For this single frame at time t, we load its two predecessors, which are then used to predict the frame at t, and thereby create a positive example. Further, we found, that the infoNCE-loss to be numerically unstable in our case, hence we opted for the almost identical but more stable NT-Xent in our implementation.

References

[1] Löwe, Sindy et al. (2020). Learning object-centric video models by contrasting sets. Google Brain team.

[2] Locatello, Francesco et al. Object-centric learning with slot attention.

[3] Pirk, Sören et al. (2019). Online object representations with contrastive learning. Google Brain team.

Owner
Dirk Neuhäuser
Dirk Neuhäuser
The hippynn python package - a modular library for atomistic machine learning with pytorch.

The hippynn python package - a modular library for atomistic machine learning with pytorch. We aim to provide a powerful library for the training of a

Los Alamos National Laboratory 37 Dec 29, 2022
[ICCV 2021] Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Amplitude-Phase Recombination (ICCV'21) Official PyTorch implementation of "Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neur

Guangyao Chen 53 Oct 05, 2022
Implementation of the famous Image Manipulation\Forgery Detector "ManTraNet" in Pytorch

Who has never met a forged picture on the web ? No one ! Everyday we are constantly facing fake pictures touched up in Photoshop but it is not always

Rony Abecidan 77 Dec 16, 2022
A scikit-learn-compatible module for estimating prediction intervals.

MAPIE - Model Agnostic Prediction Interval Estimator MAPIE allows you to easily estimate prediction intervals (or prediction sets) using your favourit

588 Jan 04, 2023
Implementation of Invariant Point Attention, used for coordinate refinement in the structure module of Alphafold2, as a standalone Pytorch module

Invariant Point Attention - Pytorch Implementation of Invariant Point Attention as a standalone module, which was used in the structure module of Alph

Phil Wang 113 Jan 05, 2023
Dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K Our dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

96 Jul 05, 2022
The PyTorch implementation for paper "Neural Texture Extraction and Distribution for Controllable Person Image Synthesis" (CVPR2022 Oral)

ArXiv | Get Start Neural-Texture-Extraction-Distribution The PyTorch implementation for our paper "Neural Texture Extraction and Distribution for Cont

Ren Yurui 111 Dec 10, 2022
Implementation of ICCV21 paper: PnP-DETR: Towards Efficient Visual Analysis with Transformers

Implementation of ICCV 2021 paper: PnP-DETR: Towards Efficient Visual Analysis with Transformers arxiv This repository is based on detr Recently, DETR

twang 113 Dec 27, 2022
Keras Realtime Multi-Person Pose Estimation - Keras version of Realtime Multi-Person Pose Estimation project

This repository has become incompatible with the latest and recommended version of Tensorflow 2.0 Instead of refactoring this code painfully, I create

M Faber 769 Dec 08, 2022
PyTorch code for our paper "Gated Multiple Feedback Network for Image Super-Resolution" (BMVC2019)

Gated Multiple Feedback Network for Image Super-Resolution This repository contains the PyTorch implementation for the proposed GMFN [arXiv]. The fram

Qilei Li 66 Nov 03, 2022
Get started learning C# with C# notebooks powered by .NET Interactive and VS Code.

.NET Interactive Notebooks for C# Welcome to the home of .NET interactive notebooks for C#! How to Install Download the .NET Coding Pack for VS Code f

.NET Platform 425 Dec 25, 2022
atmaCup #11 の Public 4th / Pricvate 5th Solution のリポジトリです。

#11 atmaCup 2021-07-09 ~ 2020-07-21 に行われた #11 [初心者歓迎! / 画像編] atmaCup のリポジトリです。結果は Public 4th / Private 5th でした。 フレームワークは PyTorch で、実装は pytorch-image-m

Tawara 12 Apr 07, 2022
Code to produce syntactic representations that can be used to study syntax processing in the human brain

Can fMRI reveal the representation of syntactic structure in the brain? The code base for our paper on understanding syntactic representations in the

Aniketh Janardhan Reddy 4 Dec 18, 2022
Differentiable scientific computing library

xitorch: differentiable scientific computing library xitorch is a PyTorch-based library of differentiable functions and functionals that can be widely

98 Dec 26, 2022
Hcpy - Interface with Home Connect appliances in Python

Interface with Home Connect appliances in Python This is a very, very beta inter

Trammell Hudson 116 Dec 27, 2022
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 05, 2022
Doods2 - API for detecting objects in images and video streams using Tensorflow

DOODS2 - Return of DOODS Dedicated Open Object Detection Service - Yes, it's a b

Zach 101 Jan 04, 2023
Chinese Advertisement Board Identification(Pytorch)

Chinese-Advertisement-Board-Identification. We use YoloV5 to extract the ROI of the location of the chinese word. Next, we sort the bounding box and recognize every chinese words which we extracted.

Li-Wei Hsiao 12 Jul 21, 2022
ICCV2021 - A New Journey from SDRTV to HDRTV.

ICCV2021 - A New Journey from SDRTV to HDRTV.

XyChen 82 Dec 27, 2022
Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning Reference Abeßer, J. & Müller, M. Towards Audio Domain Adapt

Jakob Abeßer 2 Jul 06, 2022