SAS: Self-Augmentation Strategy for Language Model Pre-training

This repository contains the official pytorch implementation for the paper "SAS: Self-Augmentation Strategy for Language Model Pre-training" based on Huggingface transformers version 4.3.0.

Only the SAS without the disentangled attention mechanism is released for now. To be updated.

File structure

train.py: The file for pre-training.
run_glue.py: The file for finetuning.
models
- modeling_sas.py: The main algorithm for the SAS.
- trainer_sas.py: It is inherited from Huggingface transformers. It is mainly modified for data processing.
utils: It includes all the utilities.
- data_collator_sas.py: It includes the details about self-augmentations.
The rest of codes are supportive.

How to

Download and Install

Clone this repository.
Download dataset for wiki-corpus. Store it to data folder. Currently, we only provide a trail data with 1 million sentence. Full dataset can be pre-processed according to BERT. Detail to be released.

(Optional) Create an environment through conda by the provided environment.yml
- You can also manually install the package:
  - Python==3.9, pytorch==1.10.0, transformers==4.3.0, etc.

    # Clone package
    git clone [email protected]:fei960922/SAS-Self-Augmentation-Strategy.git
    cd SAS-Self-Augmentation-Strategy

    # Establish the environment.
    conda env create -f environment.yml 
    conda activate cssl

    # Download dataset and checkpoint
    wget http://www.stat.ucla.edu/~yifeixu/sas/wiki_corpus_1M.npy

Train from stractch

    # Run default setting 
    bash script/pretrain.sh

    # Run custom setting
    python train.py

    # Starting from checkpoint 
    python train.py --start_from_checkpoint 1 --pretrain_path {PATH_TH_CHECKPOINT}

Caclulate GLUE scores

    # By running this bash, GLUE dataset will be automatically downloaded.
    bash finetune.sh MNLI 0 sas-base output_dir 5e-5 32 4 42
    bash finetune.sh MNLI 0 sas-small output_dir 1e-4 32 4 42

SAS: Self-Augmentation Strategy for Language Model Pre-training

Related tags

Overview

SAS: Self-Augmentation Strategy for Language Model Pre-training

File structure

How to

Download and Install

Train from stractch

Caclulate GLUE scores

Owner

Alibaba

use tensorflow 2.0 to tell a dog and cat from a specified picture

Specificity-preserving RGB-D Saliency Detection

Rotated Box Is Back : Accurate Box Proposal Network for Scene Text Detection

[ICCV21] Self-Calibrating Neural Radiance Fields

A collection of models for image<->text generation in ACM MM 2021.

This repository contains demos I made with the Transformers library by HuggingFace.

Code for paper "Context-self contrastive pretraining for crop type semantic segmentation"

Python Assignments for the Deep Learning lectures by Andrew NG on coursera with complete submission for grading capability.

Flexible Networks for Learning Physical Dynamics of Deformable Objects (2021)

Facilitates implementing deep neural-network backbones, data augmentations

PyTorch implementation of TSception V2 using DEAP dataset

Release of the ConditionalQA dataset

Official PyTorch Implementation of Embedding Transfer with Label Relaxation for Improved Metric Learning, CVPR 2021

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

An implementation of a sequence to sequence neural network using an encoder-decoder

deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

A Python package to process & model ChEMBL data.

Unofficial PyTorch Implementation for HifiFace (https://arxiv.org/abs/2106.09965)