Semi-supervised Learning for Sentiment Analysis

Overview

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining

Code, models and Datasets for《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》.

Download Models and Dataset

Datasets and Models are found in the follwing list.

  • Download 3.4M IMDB movie reviews. Save the data at [REVIEWS_PATH]. You can download the dataset HERE.
  • Download the vanilla RoBERTa-large model released by HuggingFace. Save the model at [VANILLA_ROBERTA_LARGE_PATH]. You can download the model HERE.
  • Download in-domain pretrained models in the paper and save the model at [PRETRAIN_MODELS]. We provide three following models. You can download HERE.
    • init-roberta-base: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.
    • semi-roberta-base: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-base model.
    • semi-roberta-large: RoBERTa-large model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-large model.
  • Download the 1M (D` + D) training dataset for the student model, save the data at [STUDENT_DATA_PATH]. You can download it HERE.
    • student_data_base: student training data generated by roberta-base teacher model
    • student_data_large: student training data generated by roberta-large teacher model
  • Download the IMDB dataset from Andrew Maas' paper. Save the data at [IMDB_DATA_PATH]. For IMDB, The training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample. You can download HERE.
  • Download shannon_preprocssor.whl to install a binarize tool. Save the .whl file at [SHANNON_PREPROCESS_WHL_PATH]. You can download HERE
  • Download the teacher model and student model that we trained. Save them at [CHECKPOINTS]. You can download HERE
    • roberta-base: teacher and student model checkpoint for roberta-base
    • roberta-large: teacher and student model checkpoint for roberta-large

Installation

pip install -r requirements.txt
pip install [SHANNON_PREPROCESS_WHL_PATH]

Quick Tour

train the roberta-large teacher model

Use the roberta model we pretrained over 3.4M reviews data to train teacher model.
Our teacher model had an accuracy rate of 96.2% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_teacher \
roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

train the roberta-large student model

Use the roberta model we pretrained over 3.4M reviews data to train student model.
Our student model had an accuracy rate of 96.8% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

evaluate the student model on the test set

Load student model checkpoint to evaluate over test set to reproduce our result.

cd sstc/tasks/semi-roberta
python evaluate.py \
--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--batch_size=10 \
--gpus=0,

Reproduce paper results step by step

1.Train in-domain LM based on RoBERTa

1.1 binarize 3.4M reviews data

You should modify the shell according to your paths. The result binarize data will be saved in [REVIEWS_PATH]/bin

cd sstc/tasks/roberta_lm
bash binarize.sh

1.2 train RoBERTa-large (or small, as you wish) over 3.4M reviews data

cd sstc/tasks/roberta_lm
python trainer.py \
--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \
--data_dir [REVIEWS_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [PRETRAIN_ROBERTA_CK_PATH] \
--val_check_interval 0.1 \
--precision 16 \
--batch_size 10 \
--distributed_backend=ddp \
--accumulate_grad_batches=50 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--warmup_steps 10000 \
--workers 8 \
--lr 2e-5

Training checkpoints will be saved in [PRETRAIN_ROBERTA_CK_PATH], find the best checkpoint and convert it to HuggingFace bin format, The relevant code can be found in sstc/tasks/roberta_lm/trainer.py. Save the pretrain bin model at [PRETRAIN_MODELS]\semi-roberta-large, or you can just download the model we trained.

2.train the teacher model

2.1 binarize IMDB dataset.

cd sstc/tasks/semi_roberta/scripts
bash binarize_imdb.sh

You can run the above code to binarize IMDB data, or you can just use the file we binarized in [IMDB_DATA_PATH]\bin

2.2 train the teacher model

cd sstc/tasks/semi_roberta
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

After training, teacher model checkpoint will be save in [ROOT_SAVE_PATH]/train_teacher_checkpoint. The teacher model we trained had an accuracy rate of 96.2% on the test set. The download link of teacher model checkpoint can be found in quick tour part.

3.label the unlabeled in-domain data U

3.1 label 3.4M data

Use the teacher model that you trained in previous step to label 3.4M reviews data, notice that [ROOT_SAVE_PATH] should be the same as previous setting. The labeled data will be save in [ROOT_SAVE_PATH]\predictions.

cd sstc/tasks/roberta_lm
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_ROBERTA_PATH] \
--reviews_data_path [REVIEWS_PATH]/bin \
--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] 

3.2 select the top-K data points

Firstly, we random sample 3M data from 3.4M reviews data as U', then we select 1M data from U' with the highest score as D', finally, we concat the IMDB train data(D) and D' as train data for student model. The student train data will be saved in [ROOT_SAVE_PATH]\student_data\train.txt, or you can use the data we provide in [STUDENT_DATA_PATH]/student_data_large

cd sstc/tasks/roberta_lm
python data_selector.py \
--imdb_data_path [IMDB_DATA_PATH] \
--save_path [ROOT_SAVE_PATH] 

4.train the student model

4.1 binarize the dataset

You can use the same script in 3.1 to binarize student train data in [ROOT_SAVE_PATH]\student_data\train.txt

4.1 train the student model

use can use the training data we provide in [STUDENT_DATA_PATH]/student_data_large/bin or use your own training data in [ROOT_SAVE_PATH]\student_data\bin, make sure you set the right student_data_path.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

After training, student model checkpoint will be save in [ROOT_SAVE_PATH]/train_student_checkpoint. The student model we trained had an accuracy rate of 96.6% on the test set. The download link of student model checkpoint can be found in Quick tour part.

Python scripts form performing stereo depth estimation using the CoEx model in ONNX.

ONNX-CoEx-Stereo-Depth-estimation Python scripts form performing stereo depth estimation using the CoEx model in ONNX. Stereo depth estimation on the

Ibai Gorordo 8 Dec 29, 2022
[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans Introduction We introduce the task of dense captioning in 3D scans from commodity RGB-D sensor

Dave Z. Chen 79 Nov 07, 2022
Plato: A New Framework for Federated Learning Research

a new software framework to facilitate scalable federated learning research.

System <a href=[email protected] Lab"> 192 Jan 05, 2023
A library for hidden semi-Markov models with explicit durations

hsmmlearn hsmmlearn is a library for unsupervised learning of hidden semi-Markov models with explicit durations. It is a port of the hsmm package for

Joris Vankerschaver 69 Dec 20, 2022
An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

relational-rnn-pytorch An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch. Relational Memory Core (

Sang-gil Lee 241 Nov 18, 2022
Semi-supervised Implicit Scene Completion from Sparse LiDAR

Semi-supervised Implicit Scene Completion from Sparse LiDAR Paper Created by Pengfei Li, Yongliang Shi, Tianyu Liu, Hao Zhao, Guyue Zhou and YA-QIN ZH

114 Nov 30, 2022
Interactive dimensionality reduction for large datasets

BlosSOM 🌼 BlosSOM is a graphical environment for running semi-supervised dimensionality reduction with EmbedSOM. You can use it to explore multidimen

19 Dec 14, 2022
Continuum Learning with GEM: Gradient Episodic Memory

Gradient Episodic Memory for Continual Learning Source code for the paper: @inproceedings{GradientEpisodicMemory, title={Gradient Episodic Memory

Facebook Research 360 Dec 27, 2022
Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation [OpenReview] [arXiv] [Code] The official implementation of GeoDiff: A Geome

Minkai Xu 155 Dec 26, 2022
AbelNN: Deep Learning Python module from scratch

AbelNN: Deep Learning Python module from scratch I have implemented several neural networks from scratch using only Numpy. I have designed the module

Abel 2 Apr 12, 2022
TensorFlow implementation of Deep Reinforcement Learning papers

Deep Reinforcement Learning in TensorFlow TensorFlow implementation of Deep Reinforcement Learning papers. This implementation contains: [1] Playing A

Taehoon Kim 1.6k Jan 03, 2023
Fully-automated scripts for collecting AI-related papers

AI-Paper-collector Fully-automated scripts for collecting AI-related papers List of Conferences to crawel ACL: 21-19 (including findings) EMNLP: 21-19

Gordon Lee 776 Jan 08, 2023
Reverse engineering Rosetta 2 in M1 Mac

Project Champollion About this project Rosetta 2 is an emulation mechanism to run the x86_64 applications on Arm-based Apple Silicon with Ahead-Of-Tim

FFRI Security, Inc. 258 Jan 07, 2023
Human Pose Detection on EdgeTPU

Coral PoseNet Pose estimation refers to computer vision techniques that detect human figures in images and video, so that one could determine, for exa

google-coral 476 Dec 31, 2022
An open-source outlier detection package by Getcontact Data Team

pyfbad The pyfbad library supports anomaly detection projects. An end-to-end anomaly detection application can be written using the source codes of th

Teknasyon Tech 41 Dec 27, 2022
Self-Adaptable Point Processes with Nonparametric Time Decays

NPPDecay This is our implementation for the paper Self-Adaptable Point Processes with Nonparametric Time Decays, by Zhimeng Pan, Zheng Wang, Jeff M. P

zpan 2 Sep 24, 2022
Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks

Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks Stable Neural ODE with Lyapunov-Stable Equilibrium

Kang Qiyu 8 Dec 12, 2022
FS-Mol: A Few-Shot Learning Dataset of Molecules

FS-Mol is A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets. The dataset is presented with a model evaluation

Microsoft 114 Dec 15, 2022
Semantic Segmentation Suite in TensorFlow

Semantic Segmentation Suite in TensorFlow. Implement, train, and test new Semantic Segmentation models easily!

George Seif 2.5k Jan 06, 2023
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022