PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Overview

data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (FAIR)

Data2Vec is the first high-performance self-supervised algorithm that learns the same way in multiple modalities, including speech, vision and text. Most machines learn exclusively from labeled data. However, through self-supervised learning, machines are able to learn about the world just by observing it and then figuring out the structure of images, speech or text. This is a more scalable and efficient approach for machines to tackle new complex tasks, such as understanding text for more spoken languages.

In summary, the method is as follows:

  1. The encoder extracts features from the masked inputs. These features are outputs of every transformer/linear layer.
  2. The teacher which is an EMA instance of the encoder (in eval model), extracts features from the unmasked inputs.
  3. Optional normalizations are applied to the layers/outputs of the teacher.
  4. Encoder outputs are regressed by a projection block/layer.
  5. The loss is calculated from encoder outputs and teacher outputs.

You can read the paper for more detail.

Implementation

Data2Vec is already implemented in fairseq in which for all modalities there is a seperate implementation (text, vision, audio). According to the paper:

Our primary is to design a single learning mechanism for different modalities. Despite the unified learning regime, we still use modality-specific features extractors and masking strategies. This makes sense given the vastly different nature of the input data.

This implementation differs in the fact that a single Data2Vec model is provided powered by a custom encoder (implemented using PyTorch + HuggingFace Transformers) and tries to unify the whole concept in a single module. The key concept is that there must be modality-specific feature extractions and masking strategies.

  • Masking: For each modality, the Dataset instance must return the masked source, the target and the mask tensor.

  • Feature Extraction: Features are the outputs from the transformer/attention layers. So the forward method must return outputs from all Encoder blocks of the transformer model. HuggingFace Transformers/Fairseq models return transformer layers outputs separately out of the box.

This implementation uses HuggingFace Transformers models as encoders for Data2Vec which you can inspect in the encoder.py files for each modality. Although, you can provide your own encoder model. Just make sure that your encoder must be Transformer-based according to the paper and outputs from every encoder layer must be provided.

Note: This implementation's goal is to provide the necessary building blocks of Data2Vec so anyone can adapt it to their own use case with ease, so in order to make it easy to get hands on, some functionalities like mixed precision, distributed training, etc are not included to keep it as clean & simple as possible. If you only need to train a standard large scale Data2Vec model use the official repo.

Train

First things first, install the requirements:

pip install -r requirements.txt

NLP

Train a Language Model based on RoBERTa (HuggingFace) on WikiText103

Configure the related properties in text/configs/roberta-pretraining.yaml and run:

python train.py --config text/configs/roberta-pretraining.yaml 

Vision

Run a Masked Image modeling training based on BEiT (HuggingFace)

Pass the path to the image dataset in the config file at vision/configs/beit-pretraining.yaml under dataset > path > train/test and modify other properties as you desire and run the following:

python train.py --config vision/configs/beit-pretraining.yaml 

Speech

Audio pretraining based on Wav2Vec2 (HuggingFace) on timit dataset. If you want to use other datasets like librispeech provide it in audio/dataset.py (some minor changes to the timit class would do the job because both are loaded from HuggingFace datasets)

Configure other properties as you desire and run the following:

python train.py --config audio/configs/wav2vec2-pretraining.yaml 

Pre-trained Weights

The models are available on HuggingFace Hub and you can use them like below:

RoBERTa

Data2Vec model trained with RoBERTa as the encoder (data2vec-roberta-base)

from transformers import AutoModel, AutoConfig
from transformers import RobertaModel

checkpoint = 'arxyzan/data2vec-roberta-base'

# Option 1: load using AutoModel
data2vec_roberta = AutoModel.from_pretrained(checkpoint)

# Option 2: load directly by RobertaModel
data2vec_roberta = RobertaModel.from_pretrained(checkpoint)

BEiT

Data2Vec model trained with BEiT as the encoder (data2vec-beit-base)

from transformers import AutoModel, AutoConfig
from transformers import BeitModel

checkpoint = 'arxyzan/data2vec-beit-base'

# Option 1: load using AutoModel
data2vec_beit = AutoModel.from_pretrained(checkpoint)

# Option 2: load directly by BeitModel
data2vec_beit = BeitModel.from_pretrained(checkpoint)

Wav2Vec2

Data2Vec model trained with Wav2Vec2 as the encoder (data2vec-wav2vec2-base)

from transformers import AutoModel, AutoConfig
from transformers import Wav2Vec2Model

checkpoint = 'arxyzan/data2vec-wav2vec2-base'

# Option 1: load using AutoModel
data2vec_wav2vec2 = AutoModel.from_pretrained(checkpoint)

# Option 2: load directly by Wav2Vec2Model
data2vec_wav2vec2 = Wav2Vec2Model.from_pretrained(checkpoint)

Note: The above models' weights were carefully ported from the original checkpoints in the fairseq version.

Fine-tuning

  1. Fine-tune using the checkpoints mentioned above:
# Text classification using Roberta model from HuggingFace
from transformers import RobertaModel, RobertaForSequenceClassification

checkpoint = 'arxyzan/data2vec-roberta-base'
# this is exactly a roberta model but trained with data2vec
data2vec_roberta = RobertaModel.from_pretrained(checkpoint)
text_classifier = RobertaForSequenceClassification(data2vec_roberta.config)
# assign `data2vec-roberta` weights to the roberta block of the classifier
text_classifier.roberta = data2vec_roberta
...
  1. In case you trained a model using this codebase, you can fine-tune it by taking out the encoder's state dict from the checkpoint which gives you a HuggingFace model and you can fine-tune it for any downstream task as you'd normally do for HuggingFace models.
# load a checkpoint for finetuning
from transformers import RobertaModel, RobertaConfig
roberta = RobertaModel(RobertaConfig())
checkpoint = torch.load('path/to/data2vec.pt')
roberta_state_dict = checkpoint['encoder']
# load roberta weights from the encoder part of the data2vec model
encoder = roberta.load_state_dict(roberta_state_dict)

# Now fine-tune a regular HuggingFace RoBERTa model
...

Contributions

Any contribution regarding training, development and issues are welcome!

Owner
Aryan Shekarlaban
Deep Learning Developer & Researcher
Aryan Shekarlaban
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

SUNNY BHAVEEN CHANDRA 2 Oct 02, 2022
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 03, 2023
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. from skift import FirstColFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

Shay Palachy 233 Sep 09, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Seonghwan Kim 24 Sep 11, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022