A single model that parses Universal Dependencies across 75 languages.

Overview

UDify

MIT License

UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks). This repository accompanies the paper, "75 Languages, 1 Model: Parsing Universal Dependencies Universally," providing tools to train a multilingual model capable of parsing any Universal Dependencies treebank with high accuracy. This project also supports training and evaluating for the SIGMORPHON 2019 Shared Task #2, which achieved 1st place in morphology tagging (paper can be found here).

Integration with SpaCy is supported by Camphr.

UDify Model Architecture

The project is built using AllenNLP and PyTorch.

Getting Started

Install the Python packages in requirements.txt. UDify depends on AllenNLP and PyTorch. For Windows OS, use WSL. Optionally, install TensorFlow to get access to TensorBoard to get a rich visualization of model performance on each UD task.

pip install -r ./requirements.txt

Download the UD corpus by running the script

bash ./scripts/download_ud_data.sh

or alternatively download the data from universaldependencies.org and extract into data/ud-treebanks-v2.3/, then run scripts/concat_ud_data.sh to generate the multilingual UD dataset.

Training the Model

Before training, make sure the dataset is downloaded and extracted into the data directory and the multilingual dataset is generated with scripts/concat_ud_data.sh. To train the multilingual model (fine-tune UD on BERT), run the command

python train.py --config config/ud/multilingual/udify_bert_finetune_multilingual.json --name multilingual

which will begin loading the dataset and model before training the network. The model metrics, vocab, and weights will be saved under logs/multilingual. Note that this process is highly memory intensive and requires 16+ GB of RAM and 12+ GB of GPU memory (requirements are half if fp16 is enabled in AllenNLP, but this requires custom changes to the library). The training may take 20 or more days to complete all 80 epochs depending on the type of your GPU.

Training on Other Datasets

An example config is given for fine-tuning on just English EWT. Just run:

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

To run your own dataset, copy config/ud/multilingual/udify_bert_finetune_multilingual.json and modify the following json parameters:

  • train_data_path, validation_data_path, and test_data_path to the paths of the dataset conllu files. These can be optionally null.
  • directory_path to data/vocab/ /vocabulary .
  • warmup_steps and start_step to be equal to the number of steps in the first epoch. A good initial value is in the range 100-1000. Alternatively, run the training script first to see the number of steps to the right of the progress bar.
  • If using just one treebank, optionally add xpos to the tasks list.

Viewing Model Performance

One can view how well the models are performing by running TensorBoard

tensorboard --logdir logs

This should show the currently trained model as well as any other previously trained models. The model will be stored in a folder specified by the --name parameter as well as a date stamp, e.g., logs/multilingual/2019.07.03_11.08.51.

Pretrained Models

Pretrained models can be found here. This can be used for predicting conllu annotations or for fine-tuning. The link contains the following:

  • udify-model.tar.gz - The full UDify model archive that can be used for prediction with predict.py. Note that this model has been trained for extra epochs, and may differ slightly from the model shown in the original research paper.
  • udify-bert.tar.gz - The extracted BERT weights from the UDify model, in huggingface transformers (pytorch-pretrained-bert) format.

Predicting Universal Dependencies from a Trained Model

To predict UD annotations, one can supply the path to the trained model and an input conllu-formatted file:

python predict.py <archive> <input.conllu> <output.conllu> [--eval_file results.json]

For instance, predicting the dev set of English EWT with the trained model saved under logs/model.tar.gz and UD treebanks at data/ud-treebanks-v2.3 can be done with

python predict.py logs/model.tar.gz  data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu logs/pred.conllu --eval_file logs/pred.json

and will save the output predictions to logs/pred.conllu and evaluation to logs/pred.json.

Configuration Options

  1. One can specify the type of device to run on. For a single GPU, use the flag --device 0, or --device -1 for CPU.
  2. To skip waiting for the dataset to be fully loaded into memory, use the flag --lazy. Note that the dataset won't be shuffled.
  3. Resume an existing training run with --resume .
  4. Specify a config file with --config .

SIGMORPHON 2019 Shared Task

A modification to the basic UDify model is available for parsing morphology in the SIGMORPHON 2019 Shared Task #2. The following paper describes the model in more detail: "Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning".

Training is similar to UD, just run download_sigmorphon_data.sh and then use the configuration file under config/sigmorphon/multilingual, e.g.,

python train.py --config config/sigmorphon/multilingual/udify_bert_sigmorphon_multilingual.json --name sigmorphon

FAQ

  1. When fine-tuning, my scores/metrics show poor performance.

It should take about 10 epochs to start seeing good scores coming from all the metrics, and 80 epochs to be competitive with UDPipe Future.

One caveat is that if you use a subset of treebanks for fine-tuning instead of all 124 UD v2.3 treebanks, you must modify the configuration file. Make sure to tune the learning rate scheduler to the number of training steps. Copy the udify_bert_finetune_multilingual.json config and modify the "warmup_steps" and "start_step" values. A good initial choice would be to set both to be equal to the number of training batches of one epoch (run the training script first to see the batches remaining, to the right of the progress bar).

Have a question not listed here? Open a GitHub Issue.

Citing This Research

If you use UDify for your research, please cite this work as:

@inproceedings{kondratyuk-straka-2019-75,
    title = {75 Languages, 1 Model: Parsing Universal Dependencies Universally},
    author = {Kondratyuk, Dan and Straka, Milan},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
    year = {2019},
    address = {Hong Kong, China},
    publisher = {Association for Computational Linguistics},
    url = {https://www.aclweb.org/anthology/D19-1279},
    pages = {2779--2795}
}
Owner
Dan Kondratyuk
Machine Learning, NLP, and Computer Vision. I love a fresh challenge—be it a math problem, a physics puzzle, or programming quandary.
Dan Kondratyuk
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

Florian Leuerer 26 May 27, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 06, 2023
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021