AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Overview

LITMUS Predictor

LITMUS Predictor provides support for simulating performance in ~100 languages given training observations of the desired task-model. Each training observation specifies the finetuning-datasize + test-performance in different languages.

Further, the tool provides support for constructing a data-collection strategy to maximize performance in desired targets subject to different constraints.

Installation

pip install -U pip
pip install -r requirements.txt

Usage

litmus/litmus_mixing.py contains the implementation of the LITMUS Predictor which can be trained on observations of different task-model trainings.

usage: LITMUS Tool [-h] [--scores_file SCORES_FILE]
                   [--train_format {json,csv}] [--save_state SAVE_STATE]
                   [--load_state LOAD_STATE]
                   [--precomputed_features PRECOMPUTED_FEATURES]
                   [--pivot_features {none,all,data_only}] [--use_all_langs]
                   [--common_scaling] [--training_algorithm {xgboost,mlp}]
                   [--error_method {LOO,LOTO,split,kfold,manual_split}]
                   [--data_sizes DATA_SIZES] [--mode MODE [MODE ...]]
                   [--output_dir OUTPUT_DIR]
                   [--heatmap_targets HEATMAP_TARGETS]
                   [--suggestions_budget SUGGESTIONS_BUDGET]
                   [--suggestions_langbudget SUGGESTIONS_LANGBUDGET]
                   [--suggestions_targets SUGGESTIONS_TARGETS]
                   [--suggestions_weights SUGGESTIONS_WEIGHTS]
                   [--suggestions_pivots SUGGESTIONS_PIVOTS]
                   [--suggestions_augmentable SUGGESTIONS_AUGMENTABLE]
                   [--suggestions_grid {exponential,linear}]
                   [--suggestions_objective {avg,min}]
                   [--suggestions_minperf SUGGESTIONS_MINPERF]
                   [--suggestions_minlangperf SUGGESTIONS_MINLANGPERF]
                   [--suggestions_verbose]
                   {mbert,xlmr}

positional arguments:
  {mbert,xlmr}          name of model to use

optional arguments:
  -h, --help            show this help message and exit
  --scores_file SCORES_FILE
                        path of json file containing scores to train on
  --train_format {json,csv}
                        Format of the training data
  --save_state SAVE_STATE
                        Save state of training of model to pickle file
  --load_state LOAD_STATE
                        Load trained model from pickle file
  --precomputed_features PRECOMPUTED_FEATURES
                        Path to precomputed-features file.
  --pivot_features {none,all,data_only}
                        What features based on pivot langs to use
  --use_all_langs       Add features based on all langs the tool supports
                        (Needed for transfer)
  --common_scaling      Common min max scaling params that are pvt
                        dependent(data size, type overlap, distance)
  --training_algorithm {xgboost,mlp}
                        which regressor to use
  --error_method {LOO,LOTO,split,kfold,manual_split}

  --data_sizes DATA_SIZES
                        Pivot data-size configs (semi-colon separated configs,
                        each config itself being comma-separated key-value
                        pairs)

  --mode MODE [MODE ...]
                        Output modes (comma-separated). Choose from following:
                        {heatmap, suggestions}.
  --output_dir OUTPUT_DIR
                        Overrride output directory
  --heatmap_targets HEATMAP_TARGETS
                        Targets for heatmap. Overrides suggestions_targets
                        (which is used by deafult)

  --suggestions_budget SUGGESTIONS_BUDGET
                        Budget for finding suggestions of which languages to
                        add data for (0 to disable)
  --suggestions_langbudget SUGGESTIONS_LANGBUDGET
                        Language-specific budget for finding suggestions
                        (overrrides suggestions_budget for these langs, comma-
                        separated list of key:value pairs)
  --suggestions_targets SUGGESTIONS_TARGETS
                        Targets being considered (comma-separated)
  --suggestions_weights SUGGESTIONS_WEIGHTS
                        Target weights for avg perf objective (comma-separated
                        list of key:value pairs, default wt=1)
  --suggestions_pivots SUGGESTIONS_PIVOTS
                        Index of desired row in data_sizes
  --suggestions_augmentable SUGGESTIONS_AUGMENTABLE
                        Set of augmentable languages (comma-separated)
  --suggestions_grid {exponential,linear}
                        Search space grid to use for suggestions
  --suggestions_objective {avg,min}
                        Objective function to be used for finding suggestions
  --suggestions_minperf SUGGESTIONS_MINPERF
                        Minimum acceptable average performance across tgts
  --suggestions_minlangperf SUGGESTIONS_MINLANGPERF
                        Minimum acceptable performance for given tgts (comma-
                        separated list of key:value pairs)
  --suggestions_verbose
                        Verbose logging of search

Examples

From shell

python3 litmus_mixing.py xlmr --scores_file training_observations.json --common_scaling --error_method split --mode heatmap --data_sizes "en:1000,hi:1000;en:1000,ar:1000" --use_all_langs --heatmap_targets en,fr,de,hi,ar,ru

From external scripts

from litmus import litmus_mixing

data_file = "" # Location of train data file
args = litmus_mixing.parse_args([
    "xlmr", data_file,
    "--common_scaling",
    "--error_method", "kfold",
    "--training_algorithm", "xgboost"
])
res = litmus_mixing.litmus_main(args)

WebApp

frontend/ contains the code for hosting the tool as a webapp using Azure Functions. frontend/WebUx implements the client-side as a static website which interacts with a Azure Functions backend which internally runs the litmus/litmus_mixing.py script.

Instructions to self-host

  1. Create an Azure Functions resource on Azure.
  2. Install Azure CLI and Functions Core Tools
  3. cd into the frontend/ directory and deploy to azure functions using func azure functionapp publish .

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 01, 2022
Implementation of legal QA system based on SentenceKoBART

LegalQA using SentenceKoBART Implementation of legal QA system based on SentenceKoBART How to train SentenceKoBART Based on Neural Search Engine Jina

Heewon Jeon(gogamza) 75 Dec 27, 2022
DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches the answers out of 60 billion phrases in Wikipedia, it is also v

Jinhyuk Lee 543 Jan 08, 2023
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
Application to help find best train itinerary, uses speech to text, has a spam filter to segregate invalid inputs, NLP and Pathfinding algos.

T-IAI-901-MSC2022 - GROUP 18 Gestion de projet Notre travail a été organisé et réparti dans un Trello. https://trello.com/b/X3s2fpPJ/ia-projet Install

1 Feb 05, 2022
Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

NAME inflect.py - Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. SYNOPSIS import inflect p = in

Jason R. Coombs 762 Dec 29, 2022
Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Twitter-Sentiment-Analysis The hands-on project is in Python 3 Programming class offered by University of Michigan via Coursera. The task is to build

Eszter Pai 1 Jan 03, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022