Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Last update: Dec 30, 2022

Overview

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

SentiLARE is a sentiment-aware pre-trained language model enhanced by linguistic knowledge. You can read our paper for more details. This project is a PyTorch implementation of our work.

Dependencies

Python 3
NumPy
Scikit-learn
PyTorch >= 1.3.0
PyTorch-Transformers (Huggingface) 1.2.0
TensorboardX
Sentence Transformers 0.2.6 (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
NLTK (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Our experiments contain sentence-level sentiment classification (e.g. SST / MR / IMDB / Yelp-2 / Yelp-5) and aspect-level sentiment analysis (e.g. Lap14 / Res14 / Res16). You can download the pre-processed datasets (Google Drive / Tsinghua Cloud) of the downstream tasks. The detailed description of the data formats is attached to the datasets.

Fine-tuning

To quickly conduct the fine-tuning experiments, you can directly download the checkpoint (Google Drive / Tsinghua Cloud) of our pre-trained model. We show the example of fine-tuning SentiLARE on SST as follows:

cd finetune
CUDA_VISIBLE_DEVICES=0,1,2 python run_sent_sentilr_roberta.py \
          --data_dir data/sent/sst \
          --model_type roberta \
          --model_name_or_path pretrain_model/ \
          --task_name sst \
          --do_train \
          --do_eval \
          --max_seq_length 256 \
          --per_gpu_train_batch_size 4 \
          --learning_rate 2e-5 \
          --num_train_epochs 3 \
          --output_dir sent_finetune/sst \
          --logging_steps 100 \
          --save_steps 100 \
          --warmup_steps 100 \
          --eval_all_checkpoints \
          --overwrite_output_dir

Note that data_dir is set to the directory of pre-processed SST dataset, and model_name_or_path is set to the directory of the pre-trained model checkpoint. output_dir is the directory to save the fine-tuning checkpoints. You can refer to the fine-tuning codes to get the description of other hyper-parameters.

More details about fine-tuning SentiLARE on other datasets can be found in finetune/README.MD.

POS Tagging and Polarity Acquisition for Downstream Tasks

During pre-processing, we tokenize the original datasets with NLTK, tag the sentences with Stanford Log-Linear Part-of-Speech Tagger, and obtain the sentiment polarity with Sentence-BERT.

Pre-training

If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, this part may help you pre-process the pre-training dataset and run the pre-training scripts.

Dataset

We use Yelp Dataset Challenge 2019 as our pre-training dataset. According to the Term of Use of Yelp dataset, you should download Yelp dataset on your own.

POS Tagging and Polarity Acquisition for Pre-training Dataset

Similar to fine-tuning, we also conduct part-of-speech tagging and sentiment polarity acquisition on the pre-training dataset. Note that since the pre-training dataset is quite large, the pre-processing procedure may take a long time because we need to use Sentence-BERT to obtain the representation vectors of all the sentences in the pre-training dataset.

Pre-training

Refer to pretrain/README.MD for more implementation details about pre-training.

Citation

@inproceedings{ke-etal-2020-sentilare,
    title = "{S}enti{LARE}: Sentiment-Aware Language Representation Learning with Linguistic Knowledge",
    author = "Ke, Pei  and Ji, Haozhe  and Liu, Siyang  and Zhu, Xiaoyan  and Huang, Minlie",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "6975--6988",
}

Please kindly cite our paper if this paper and the codes are helpful.

Thanks

Many thanks to the GitHub repositories of Transformers and BERT-PT. Part of our codes are modified based on their codes.

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Related tags

Overview

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge

Introduction

Dependencies

Quick Start for Fine-tuning

Datasets of Downstream Tasks

Fine-tuning

POS Tagging and Polarity Acquisition for Downstream Tasks

Pre-training

Dataset

POS Tagging and Polarity Acquisition for Pre-training Dataset

Pre-training

Citation

Thanks

Owner

Pytorch implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling"

small collection of functions for neural networks

Deep Two-View Structure-from-Motion Revisited

MusicYOLO framework uses the object detection model, YOLOx, to locate notes in the spectrogram.

Learning-based agent for Google Research Football

Extreme Dynamic Classifier Chains - XGBoost for Multi-label Classification

GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors

Flax is a neural network ecosystem for JAX that is designed for flexibility.

Official implementation of NeurIPS'2021 paper TransformerFusion

《Lerning n Intrinsic Grment Spce for Interctive Authoring of Grment Animtion》

OCR Streamlit App is used to extract text from images using python's easyocr, pytorch and streamlit packages

Code from Daniel Lemire, A Better Alternative to Piecewise Linear Time Series Segmentation

Neural Module Network for VQA in Pytorch

Defending graph neural networks against adversarial attacks (NeurIPS 2020)

Implementation of CVAE. Trained CVAE on faces from UTKFace Dataset to produce synthetic faces with a given degree of happiness/smileyness.

Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation

PyTorch implementations of neural network models for keyword spotting

Pytorch version of SfmLearner from Tinghui Zhou et al.

Near-Duplicate Video Retrieval with Deep Metric Learning

ML powered analytics engine for outlier detection and root cause analysis.