Fuzzy Overclustering (FOC)

Overview

Fuzzy Overclustering (FOC)

In real-world datasets, we need consistent annotations between annotators to give a certain ground-truth label. However, in many applications these consistent annotations can not be given due to issues like intra- and interobserver variability. We call these inconsistent label fuzzy. Our method Fuzzy Overclustering overclusters the data and can therefore handle these fuzzy labels better than Out-of-the-Box Semi-Supervised Methods.

More details are given in the accpeted full paper at https://doi.org/10.3390/s21196661 or in the preprint https://arxiv.org/abs/2012.01768

The main idea is illustrated below. The graphic and caption are taken from the original work.

main idea of paper

Illustration of fuzzy data and overclustering -- The grey dots represent unlabeled data and the colored dots labeled data from different classes. The dashed lines represent decision boundaries. For certain data, a clear separation of the different classes with one decision boundary is possible and both classes contain the same amount of data (top). For fuzzy data determining a decision boundary is difficult because of intermediate datapoints between the classes (middle). These fuzzy datapoints can often not be easily sorted into one consistent class between annotators. If you overcluster the data, you get smaller but more consistent substructures in the fuzzy data (bottom). The images illustrate possible examples for \certain data (cat & dog) and \fuzzy plankton data (trichodesmium puff and tuft). The center plankton image was considered to be trichodesmium puff or tuft by around half of the annotators each. The left and right plankton image were consistently annotated as their respective class.

Installation

We advise to use docker for the experiments. We recommend a python3 container with tesnorflow 1.14 preinstalled. Additionally the following commands need to be executed:

apt-get update
apt-get install -y libsm6 libxext6 libxrender-dev libgl1-mesa-glx

After this ensure that the requirements from requirements.txt are installed. The most important packages are keras, scipy and opencv.

Usage

The parameters are given in arguments.yaml with their description. Most of the parameters can be left at the default value. Especially the dataset, batch size and epoch related parameters are imported.

As a rule of thumb the following should be applied:

  • overcluster_k = 5-6 * the number of classes
  • batch_size = repetition * overcluster_k * 2-3

You need to define three directories for the execution with docker:

  • DATASET_ROOT, this folder contains a folder with the dataset name. This folder contains a trainand val folder. It needs a folder unlabeled if the parameter unlabeled_data is used. Each folder contains subfolder with the given classes.
  • LOG_ROOT, inside a subdiretory logs all experimental results will be stored with regard to the given IDs and a time stamp
  • SRC_ROOT root of the this project source code

The DOCKER_IMAGE is the above defined image.

You can visualize the results with tensorboard --logdir . from inside the log_dir

Example Usages

bash % test pipeline running docker run -it --rm -v :/data-ssd -v :/data1 -v :/src -w="/src" python main.py --IDs foc experiment_name not_use_mi --dataset [email protected] --unlabeled_data [email protected] --frozen_batch_size 130 --batch_size 130 --overcluster_k 60 --num_gpus 1 --normal_epoch 2 --frozen_epoch 1 % training FOC-Light docker run -it --rm -v :/data-ssd -v :/data1 -v :/home -w="/home" python main.py --experiment_identifiers foc experiment_name not_use_mi --dataset stl10 --frozen_batch_size 130 --batch_size 130 --overcluster_k 60 --num_gpus 1 % training FOC (no warmup) % needs multiple GPUs or very large ones (change num gpu to 1 in this case) docker run -it --rm -v :/data-ssd -v :/data1 -v :/home -w="/home" python main.py --experiment_identifiers foc experiment_name not_use_mi --dataset stl10 --frozen_batch_size 390 --batch_size 390 --overcluster_k 60 --num_gpus 3 --lambda_m 1 --sample_repetition 3 ">
% test container
docker run -it --rm -v 
               
                :/data-ssd -v 
                
                 :/data1   -v 
                 
                  :/src -w="/src" 
                  
                    bash


% test pipeline running
docker run -it --rm -v 
                   
                    :/data-ssd -v 
                    
                     :/data1 -v 
                     
                      :/src -w="/src" 
                      
                        python main.py --IDs foc experiment_name not_use_mi --dataset [email protected] --unlabeled_data [email protected] --frozen_batch_size 130 --batch_size 130 --overcluster_k 60 --num_gpus 1 --normal_epoch 2 --frozen_epoch 1 % training FOC-Light docker run -it --rm -v 
                       
                        :/data-ssd -v 
                        
                         :/data1 -v 
                         
                          :/home -w="/home" 
                          
                            python main.py --experiment_identifiers foc experiment_name not_use_mi --dataset stl10 --frozen_batch_size 130 --batch_size 130 --overcluster_k 60 --num_gpus 1 % training FOC (no warmup) % needs multiple GPUs or very large ones (change num gpu to 1 in this case) docker run -it --rm -v 
                           
                            :/data-ssd -v 
                            
                             :/data1 -v 
                             
                              :/home -w="/home" 
                              
                                python main.py --experiment_identifiers foc experiment_name not_use_mi --dataset stl10 --frozen_batch_size 390 --batch_size 390 --overcluster_k 60 --num_gpus 3 --lambda_m 1 --sample_repetition 3 
                              
                             
                            
                           
                          
                         
                        
                       
                      
                     
                    
                   
                  
                 
                
               
Faster Convex Lipschitz Regression

Faster Convex Lipschitz Regression This reepository provides a python implementation of our Faster Convex Lipschitz Regression algorithm with GPU and

Ali Siahkamari 0 Nov 19, 2021
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

CARLA - Counterfactual And Recourse Library CARLA is a python library to benchmark counterfactual explanation and recourse models. It comes out-of-the

Carla Recourse 200 Dec 28, 2022
A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

XinJingHao 56 Dec 16, 2022
Faune proche - Retrieval of Faune-France data near a google maps location

faune_proche Récupération des données de Faune-France près d'un lieu google maps

4 Feb 15, 2022
chen2020iros: Learning an Overlap-based Observation Model for 3D LiDAR Localization.

Overlap-based 3D LiDAR Monte Carlo Localization This repo contains the code for our IROS2020 paper: Learning an Overlap-based Observation Model for 3D

Photogrammetry & Robotics Bonn 219 Dec 15, 2022
A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

ffcv ImageNet Training A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get... ...high accur

FFCV 92 Dec 31, 2022
Implementation of the federated dual coordinate descent (FedDCD) method.

FedDCD.jl Implementation of the federated dual coordinate descent (FedDCD) method. Installation To install, just call Pkg.add("https://github.com/Zhen

Zhenan Fan 6 Sep 21, 2022
Scripts and misc. stuff related to the PortSwigger Web Academy

PortSwigger Web Academy Notes Mostly scripts to automate the exploits. Going in the order of the recomended learning path - starting with SQLi. Commun

pageinsec 17 Dec 30, 2022
PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

LEI TAI 111 Dec 08, 2022
PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

Ilya Kostrikov 546 Dec 05, 2022
This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Off-Belief Learning Introduction This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021. Environment Setup

Facebook Research 32 Jan 05, 2023
Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

Memformer - Pytorch Implementation of Memformer, a Memory-augmented Transformer, in Pytorch. It includes memory slots, which are updated with attentio

Phil Wang 60 Nov 06, 2022
Contour-guided image completion with perceptual grouping (BMVC 2021 publication)

Contour-guided Image Completion with Perceptual Grouping Authors Morteza Rezanejad*, Sidharth Gupta*, Chandra Gummaluru, Ryan Marten, John Wilder, Mic

Sid Gupta 6 Dec 27, 2022
pytorch implementation for Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network arXiv:1609.04802

PyTorch SRResNet Implementation of Paper: "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network"(https://arxiv.org/abs

Jiu XU 436 Jan 09, 2023
A universal memory dumper using Frida

Fridump Fridump (v0.1) is an open source memory dumping tool, primarily aimed to penetration testers and developers. Fridump is using the Frida framew

551 Jan 07, 2023
GyroSPD: Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices

GyroSPD Code for the paper "Vector-valued Distance and Gyrocalculus on the Space of Symmetric Positive Definite Matrices" accepted at NeurIPS 2021. Re

Federico Lopez 12 Dec 12, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022
Code for paper: Towards Tokenized Human Dynamics Representation

Video Tokneization Codebase for video tokenization, based on our paper Towards Tokenized Human Dynamics Representation. Prerequisites (tested under Py

Kenneth Li 20 May 31, 2022
Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings

Text2Music Emotion Embedding Text-to-Music Retrieval using Pre-defined/Data-driven Emotion Embeddings Reference Emotion Embedding Spaces for Matching

Minz Won 50 Dec 05, 2022
Application of the L2HMC algorithm to simulations in lattice QCD.

l2hmc-qcd 📊 Slides Recent talk on Training Topological Samplers for Lattice Gauge Theory from the Machine Learning for High Energy Physics, on and of

Sam Foreman 37 Dec 14, 2022