PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Last update: Apr 29, 2022

Related tags

Text Data & NLP FeatureCRFAE

Overview

Feature_CRF_AE

Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging:

@inproceedings{zhou-etal-2022-Bridging,
  title     = {Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging},
  author    = {Zhou, houquan and Li, yang and Li, Zhenghua and Zhang Min},
  booktitle = {Findings of ACL},
  year      = {2022},
  url       = {?},
  pages     = {?--?}
}

Please concact Jacob_Zhou \at outlook.com if you have any questions.

Contents
Installation
Performance
Usage

Installation

Feature_CRF_AE can be installing from source:

$ git clone https://github.com/Jacob-Zhou/FeatureCRFAE && cd FeatureCRFAE
$ bash scripts/setup.sh

The following requirements will be installed in scripts/setup.sh:

python: 3.7
allennlp: 1.2.2
pytorch: 1.6.0
transformers: 3.5.1
h5py: 3.1.0
matplotlib: 3.3.1
nltk: 3.5
numpy: 1.19.1
overrides: 3.1.0
scikit_learn: 1.0.2
seaborn: 0.11.0
tqdm: 4.49.0

For WSJ data, we use the ELMo representations of elmo_2x4096_512_2048cnn_2xhighway_5.5B from AllenNLP. For UD data, we use the ELMo representations released by HIT-SCIR.

The corresponding data and ELMo models can be download as follows:

# 1) UD data and ELMo models:
$ bash scripts/prepare_data.sh
# 2) UD data, ELMo models as well as WSJ data 
#    [please replace ~/treebank3/parsed/mrg/wsj/ with your path to LDC99T42]
$ bash scripts/prepare_data.sh ~/treebank3/parsed/mrg/wsj/

Performance

WSJ-All

Seed	M-1	1-1	VM
0	84.29	70.03	78.43
1	82.34	64.42	77.27
2	84.68	62.78	77.83
3	82.55	65.00	77.35
4	82.20	66.69	77.33
Avg.	83.21	65.78	77.64
Std.	1.18	2.75	0.49

WSJ-Test

Seed	M-1	1-1	VM
0	81.99	64.84	76.86
1	82.52	61.46	76.13
2	82.33	61.15	75.13
3	78.11	58.80	72.94
4	82.05	61.68	76.21
Avg.	81.40	61.59	75.45
Std.	1.85	2.15	1.54

Usage

We give some examples on scripts/examples.sh. Before run the code you should activate the virtual environment by:

$ . scripts/set_environment.sh

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae_wsj

$ python -u -m tagger.cmds.crf_ae train \
    --conf configs/crf_ae.ini \
    --ud-mode \
    --ud-feature \
    --ignore-capitalized \
    --language-specific-strip \
    --feat-min-freq 14 \
    --language de \
    --encoder elmo \
    --plm elmo_models/de \
    --train data/ud/de/total.conll \
    --evaluate data/ud/de/total.conll \
    --path save/crf_ae_de

For more instructions on training, please type python -m tagger.cmds.[crf_ae|feature_hmm] train -h.

Alternatively, We provides some equivalent command entry points registered in setup.py: crf-ae and feature-hmm.

$ crf-ae train \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --train data/wsj/total.conll \
    --evaluate data/wsj/total.conll \
    --path save/crf_ae

Evaluation

$ python -u -m tagger.cmds.crf_ae evaluate \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae

Predict

$ python -u -m tagger.cmds.crf_ae predict \
    --conf configs/crf_ae.ini \
    --encoder elmo \
    --plm elmo_models/allennlp/elmo_2x4096_512_2048cnn_2xhighway_5.5B \
    --data data/wsj/total.conll \
    --path save/crf_ae \
    --pred save/crf_ae/pred.conll

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Related tags

Overview

Feature_CRF_AE

Contents

Installation

Performance

WSJ-All

WSJ-Test

Usage

Training

Evaluation

Predict

Owner

Jacob Zhou

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Open-source offline translation library written in Python. Uses OpenNMT for translations

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Unsupervised Language Modeling at scale for robust sentiment classification

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Python library to make development of portfolio analysis faster and easier

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

Contains descriptions and code of the mini-projects developed in various programming languages

Final Project Bootcamp Zero

Python library for Serbian Natural language processing (NLP)

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

Задания КЕГЭ по информатике 2021 на Python