Semi-supervised Learning for Sentiment Analysis

Overview

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining

Code, models and Datasets for《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》.

Download Models and Dataset

Datasets and Models are found in the follwing list.

  • Download 3.4M IMDB movie reviews. Save the data at [REVIEWS_PATH]. You can download the dataset HERE.
  • Download the vanilla RoBERTa-large model released by HuggingFace. Save the model at [VANILLA_ROBERTA_LARGE_PATH]. You can download the model HERE.
  • Download in-domain pretrained models in the paper and save the model at [PRETRAIN_MODELS]. We provide three following models. You can download HERE.
    • init-roberta-base: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.
    • semi-roberta-base: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-base model.
    • semi-roberta-large: RoBERTa-large model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model RoBERTa-large model.
  • Download the 1M (D` + D) training dataset for the student model, save the data at [STUDENT_DATA_PATH]. You can download it HERE.
    • student_data_base: student training data generated by roberta-base teacher model
    • student_data_large: student training data generated by roberta-large teacher model
  • Download the IMDB dataset from Andrew Maas' paper. Save the data at [IMDB_DATA_PATH]. For IMDB, The training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample. You can download HERE.
  • Download shannon_preprocssor.whl to install a binarize tool. Save the .whl file at [SHANNON_PREPROCESS_WHL_PATH]. You can download HERE
  • Download the teacher model and student model that we trained. Save them at [CHECKPOINTS]. You can download HERE
    • roberta-base: teacher and student model checkpoint for roberta-base
    • roberta-large: teacher and student model checkpoint for roberta-large

Installation

pip install -r requirements.txt
pip install [SHANNON_PREPROCESS_WHL_PATH]

Quick Tour

train the roberta-large teacher model

Use the roberta model we pretrained over 3.4M reviews data to train teacher model.
Our teacher model had an accuracy rate of 96.2% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_teacher \
roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

train the roberta-large student model

Use the roberta model we pretrained over 3.4M reviews data to train student model.
Our student model had an accuracy rate of 96.8% on the test set.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

evaluate the student model on the test set

Load student model checkpoint to evaluate over test set to reproduce our result.

cd sstc/tasks/semi-roberta
python evaluate.py \
--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--batch_size=10 \
--gpus=0,

Reproduce paper results step by step

1.Train in-domain LM based on RoBERTa

1.1 binarize 3.4M reviews data

You should modify the shell according to your paths. The result binarize data will be saved in [REVIEWS_PATH]/bin

cd sstc/tasks/roberta_lm
bash binarize.sh

1.2 train RoBERTa-large (or small, as you wish) over 3.4M reviews data

cd sstc/tasks/roberta_lm
python trainer.py \
--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \
--data_dir [REVIEWS_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [PRETRAIN_ROBERTA_CK_PATH] \
--val_check_interval 0.1 \
--precision 16 \
--batch_size 10 \
--distributed_backend=ddp \
--accumulate_grad_batches=50 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--warmup_steps 10000 \
--workers 8 \
--lr 2e-5

Training checkpoints will be saved in [PRETRAIN_ROBERTA_CK_PATH], find the best checkpoint and convert it to HuggingFace bin format, The relevant code can be found in sstc/tasks/roberta_lm/trainer.py. Save the pretrain bin model at [PRETRAIN_MODELS]\semi-roberta-large, or you can just download the model we trained.

2.train the teacher model

2.1 binarize IMDB dataset.

cd sstc/tasks/semi_roberta/scripts
bash binarize_imdb.sh

You can run the above code to binarize IMDB data, or you can just use the file we binarized in [IMDB_DATA_PATH]\bin

2.2 train the teacher model

cd sstc/tasks/semi_roberta
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] \
--precision 16 \
--batch_size 10 \
--min_epochs 10 \
--patience 3 \
--lr 3e-5  

After training, teacher model checkpoint will be save in [ROOT_SAVE_PATH]/train_teacher_checkpoint. The teacher model we trained had an accuracy rate of 96.2% on the test set. The download link of teacher model checkpoint can be found in quick tour part.

3.label the unlabeled in-domain data U

3.1 label 3.4M data

Use the teacher model that you trained in previous step to label 3.4M reviews data, notice that [ROOT_SAVE_PATH] should be the same as previous setting. The labeled data will be save in [ROOT_SAVE_PATH]\predictions.

cd sstc/tasks/roberta_lm
python trainer.py \
--mode train_teacher \
--roberta_path [PRETRAIN_ROBERTA_PATH] \
--reviews_data_path [REVIEWS_PATH]/bin \
--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \
--gpus=0,1,2,3 \
--save_path [ROOT_SAVE_PATH] 

3.2 select the top-K data points

Firstly, we random sample 3M data from 3.4M reviews data as U', then we select 1M data from U' with the highest score as D', finally, we concat the IMDB train data(D) and D' as train data for student model. The student train data will be saved in [ROOT_SAVE_PATH]\student_data\train.txt, or you can use the data we provide in [STUDENT_DATA_PATH]/student_data_large

cd sstc/tasks/roberta_lm
python data_selector.py \
--imdb_data_path [IMDB_DATA_PATH] \
--save_path [ROOT_SAVE_PATH] 

4.train the student model

4.1 binarize the dataset

You can use the same script in 3.1 to binarize student train data in [ROOT_SAVE_PATH]\student_data\train.txt

4.1 train the student model

use can use the training data we provide in [STUDENT_DATA_PATH]/student_data_large/bin or use your own training data in [ROOT_SAVE_PATH]\student_data\bin, make sure you set the right student_data_path.

cd sstc/tasks/semi-roberta
python trainer.py \
--mode train_student \
--roberta_path [PRETRAIN_MODELS]\semi-roberta-large \
--imdb_data_path [IMDB_DATA_PATH]/bin \
--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \
--save_path [ROOT_SAVE_PATH] \
--batch_size=10 \
--precision 16 \
--lr=2e-5 \
--warmup_steps 40000 \
--gpus=0,1,2,3,4,5,6,7 \
--accumulate_grad_batches=50

After training, student model checkpoint will be save in [ROOT_SAVE_PATH]/train_student_checkpoint. The student model we trained had an accuracy rate of 96.6% on the test set. The download link of student model checkpoint can be found in Quick tour part.

Learning Multiresolution Matrix Factorization and its Wavelet Networks on Graphs

Project Learning Multiresolution Matrix Factorization and its Wavelet Networks on Graphs, https://arxiv.org/pdf/2111.01940.pdf. Authors Truong Son Hy

5 Jun 28, 2022
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR)

Ilya Kostrikov 3k Dec 31, 2022
Training deep models using anime, illustration images.

animeface deep models for anime images. Datasets anime-face-dataset Anime faces collected from Getchu.com. Based on Mckinsey666's dataset. 63.6K image

Tomoya Sawada 61 Dec 25, 2022
High-quality implementations of standard and SOTA methods on a variety of tasks.

Uncertainty Baselines The goal of Uncertainty Baselines is to provide a template for researchers to build on. The baselines can be a starting point fo

Google 1.1k Dec 30, 2022
the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

RMA-Net This repo is the implementation of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021). Paper

Wanquan Feng 205 Nov 09, 2022
A python-image-classification web application project, written in Python and served through the Flask Microframework

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and

Gerald Maduabuchi 19 Dec 12, 2022
SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning

SPCL SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning Update on 2021/11/25: ArXiv Ver

Binhui Xie (谢斌辉) 11 Oct 29, 2022
Code for our ALiBi method for transformer language models.

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation This repository contains the code and models for our paper Tra

Ofir Press 211 Dec 31, 2022
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
Checking fibonacci - Generating the Fibonacci sequence is a classic recursive problem

Fibonaaci Series Generating the Fibonacci sequence is a classic recursive proble

Moureen Caroline O 1 Feb 15, 2022
Unsupervised captioning - Code for Unsupervised Image Captioning

Unsupervised Image Captioning by Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo Introduction Most image captioning models are trained using paired image-se

Yang Feng 207 Dec 24, 2022
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ( Zitnik Lab @ Harvard 44 Dec 07, 2022

Python wrapper of LSODA (solving ODEs) which can be called from within numba functions.

numbalsoda numbalsoda is a python wrapper to the LSODA method in ODEPACK, which is for solving ordinary differential equation initial value problems.

Nick Wogan 52 Jan 09, 2023
(NeurIPS 2020) Wasserstein Distances for Stereo Disparity Estimation

Wasserstein Distances for Stereo Disparity Estimation Accepted in NeurIPS 2020 as Spotlight. [Project Page] Wasserstein Distances for Stereo Disparity

Divyansh Garg 92 Dec 12, 2022
[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

IFAN: Iterative Filter Adaptive Network for Single Image Defocus Deblurring Checkout for the demo (GUI/Google Colab)! The GUI version might occasional

Junyong Lee 173 Dec 30, 2022
Pretrained models for Jax/Haiku; MobileNet, ResNet, VGG, Xception.

Pre-trained image classification models for Jax/Haiku Jax/Haiku Applications are deep learning models that are made available alongside pre-trained we

Alper Baris CELIK 14 Dec 20, 2022
PIXIE: Collaborative Regression of Expressive Bodies

PIXIE: Collaborative Regression of Expressive Bodies [Project Page] This is the official Pytorch implementation of PIXIE. PIXIE reconstructs an expres

Yao Feng 331 Jan 04, 2023
Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Unsupervised Learning of Visual 3D Keypoints for Control [Project Website] [Paper] Boyuan Chen1, Pieter Abbeel1, Deepak Pathak2 1UC Berkeley 2Carnegie

Boyuan Chen 34 Jul 22, 2022
ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Benchmark for Tuning Accuracy and Efficiency Overview The benchmark includes our

HPC-AI Tech 31 Oct 07, 2022
A tensorflow implementation of an HMM layer

tensorflow_hmm Tensorflow and numpy implementations of the HMM viterbi and forward/backward algorithms. See Keras example for an example of how to use

Zach Dwiel 283 Oct 19, 2022