This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

Last update: Oct 08, 2022

Overview

The-Emergence-of-Objectness

This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos, which has been accepted by NeurIPS 2021. Code will be available soon.

Code

To be released.

Abstract

Humans can easily segment moving objects without knowing what they are. That objectness could emerge from continuous visual observations motivates us to model grouping and movement concurrently from unlabeled videos. Our premise is that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis which can be checked from the data itself without any external supervision.

Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images. It then binds them in a conjoint representation called segment flow that pools flow offsets over each region and provides a gross characterization of moving regions for the entire scene. By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.

Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on zero-shot object segmentation from an image, moving object segmentation from a video with unsupervised test-time adaptation, and semantic image segmentation by supervised fine-tuning. Our work is the first truly end-to-end zero-shot object segmentation from videos. It not only develops generic objectness for segmentation and tracking, but also outperforms prevalent image-based contrastive learning methods without augmentation engineering.

Approach

We learn a single-image segmentation network and a dual-frame motion network with an unsupervised image reconstruction loss. We sample two frames, $i$ and $j$, from a video. Frame $i$ goes through the segmentation network and outputs a set of masks, whereas frames $i$ and $j$ go through the motion network and output a feature map. The feature is pooled per mask and a flow is predicted. All the segments and their flows are combined into a segment flow representation from frame $i$ → $j$, which are used to warp frame $i$ into $j$, and compared against frame $j$ to train the two networks.

Zero-Shot Saliency Detection

Qualitative salient object detection results. We directly transfer our pretrained segmentation network to novel images on the DUTS dataset without any finetuning. Surprisingly, we find that the model pretrained on videos to segment moving objects can generalize to detect stationary unmovable objects in a static image, e.g. the statue, the plate, the bench and the tree in the last column.

Zero-shot Video Object Segmentation

This is the official released code for our paper, The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos

Related tags

Overview

The-Emergence-of-Objectness

Code

Abstract

Approach

Zero-Shot Saliency Detection

Zero-shot Video Object Segmentation

Qualitative results of SegTrackv2

Qualitative results of DAVIS 2016

Qualitative results of FBMS59

Owner

(CVPR 2021) Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds

Code for Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (AAAI 2019)

Seeing All the Angles: Learning Multiview Manipulation Policies for Contact-Rich Tasks from Demonstrations

CenterNet:Objects as Points目标检测模型在Pytorch当中的实现

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Repository for the COLING 2020 paper "Explainable Automated Fact-Checking: A Survey."

SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification

Gauge equivariant mesh cnn

Best practices for segmentation of the corporate network of any company

Parallel Latent Tree-Induction for Faster Sequence Encoding

FairMOT - A simple baseline for one-shot multi-object tracking

PyTorch implementation of the paper: "Preference-Adaptive Meta-Learning for Cold-Start Recommendation", IJCAI, 2021.

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

Implementation of Stochastic Image-to-Video Synthesis using cINNs.

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

OpenMMLab Pose Estimation Toolbox and Benchmark.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

a reimplementation of LiteFlowNet in PyTorch that matches the official Caffe version

Unrolled Variational Bayesian Algorithm for Image Blind Deconvolution