AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Overview

LITMUS Predictor

LITMUS Predictor provides support for simulating performance in ~100 languages given training observations of the desired task-model. Each training observation specifies the finetuning-datasize + test-performance in different languages.

Further, the tool provides support for constructing a data-collection strategy to maximize performance in desired targets subject to different constraints.

Installation

pip install -U pip
pip install -r requirements.txt

Usage

litmus/litmus_mixing.py contains the implementation of the LITMUS Predictor which can be trained on observations of different task-model trainings.

usage: LITMUS Tool [-h] [--scores_file SCORES_FILE]
                   [--train_format {json,csv}] [--save_state SAVE_STATE]
                   [--load_state LOAD_STATE]
                   [--precomputed_features PRECOMPUTED_FEATURES]
                   [--pivot_features {none,all,data_only}] [--use_all_langs]
                   [--common_scaling] [--training_algorithm {xgboost,mlp}]
                   [--error_method {LOO,LOTO,split,kfold,manual_split}]
                   [--data_sizes DATA_SIZES] [--mode MODE [MODE ...]]
                   [--output_dir OUTPUT_DIR]
                   [--heatmap_targets HEATMAP_TARGETS]
                   [--suggestions_budget SUGGESTIONS_BUDGET]
                   [--suggestions_langbudget SUGGESTIONS_LANGBUDGET]
                   [--suggestions_targets SUGGESTIONS_TARGETS]
                   [--suggestions_weights SUGGESTIONS_WEIGHTS]
                   [--suggestions_pivots SUGGESTIONS_PIVOTS]
                   [--suggestions_augmentable SUGGESTIONS_AUGMENTABLE]
                   [--suggestions_grid {exponential,linear}]
                   [--suggestions_objective {avg,min}]
                   [--suggestions_minperf SUGGESTIONS_MINPERF]
                   [--suggestions_minlangperf SUGGESTIONS_MINLANGPERF]
                   [--suggestions_verbose]
                   {mbert,xlmr}

positional arguments:
  {mbert,xlmr}          name of model to use

optional arguments:
  -h, --help            show this help message and exit
  --scores_file SCORES_FILE
                        path of json file containing scores to train on
  --train_format {json,csv}
                        Format of the training data
  --save_state SAVE_STATE
                        Save state of training of model to pickle file
  --load_state LOAD_STATE
                        Load trained model from pickle file
  --precomputed_features PRECOMPUTED_FEATURES
                        Path to precomputed-features file.
  --pivot_features {none,all,data_only}
                        What features based on pivot langs to use
  --use_all_langs       Add features based on all langs the tool supports
                        (Needed for transfer)
  --common_scaling      Common min max scaling params that are pvt
                        dependent(data size, type overlap, distance)
  --training_algorithm {xgboost,mlp}
                        which regressor to use
  --error_method {LOO,LOTO,split,kfold,manual_split}

  --data_sizes DATA_SIZES
                        Pivot data-size configs (semi-colon separated configs,
                        each config itself being comma-separated key-value
                        pairs)

  --mode MODE [MODE ...]
                        Output modes (comma-separated). Choose from following:
                        {heatmap, suggestions}.
  --output_dir OUTPUT_DIR
                        Overrride output directory
  --heatmap_targets HEATMAP_TARGETS
                        Targets for heatmap. Overrides suggestions_targets
                        (which is used by deafult)

  --suggestions_budget SUGGESTIONS_BUDGET
                        Budget for finding suggestions of which languages to
                        add data for (0 to disable)
  --suggestions_langbudget SUGGESTIONS_LANGBUDGET
                        Language-specific budget for finding suggestions
                        (overrrides suggestions_budget for these langs, comma-
                        separated list of key:value pairs)
  --suggestions_targets SUGGESTIONS_TARGETS
                        Targets being considered (comma-separated)
  --suggestions_weights SUGGESTIONS_WEIGHTS
                        Target weights for avg perf objective (comma-separated
                        list of key:value pairs, default wt=1)
  --suggestions_pivots SUGGESTIONS_PIVOTS
                        Index of desired row in data_sizes
  --suggestions_augmentable SUGGESTIONS_AUGMENTABLE
                        Set of augmentable languages (comma-separated)
  --suggestions_grid {exponential,linear}
                        Search space grid to use for suggestions
  --suggestions_objective {avg,min}
                        Objective function to be used for finding suggestions
  --suggestions_minperf SUGGESTIONS_MINPERF
                        Minimum acceptable average performance across tgts
  --suggestions_minlangperf SUGGESTIONS_MINLANGPERF
                        Minimum acceptable performance for given tgts (comma-
                        separated list of key:value pairs)
  --suggestions_verbose
                        Verbose logging of search

Examples

From shell

python3 litmus_mixing.py xlmr --scores_file training_observations.json --common_scaling --error_method split --mode heatmap --data_sizes "en:1000,hi:1000;en:1000,ar:1000" --use_all_langs --heatmap_targets en,fr,de,hi,ar,ru

From external scripts

from litmus import litmus_mixing

data_file = "" # Location of train data file
args = litmus_mixing.parse_args([
    "xlmr", data_file,
    "--common_scaling",
    "--error_method", "kfold",
    "--training_algorithm", "xgboost"
])
res = litmus_mixing.litmus_main(args)

WebApp

frontend/ contains the code for hosting the tool as a webapp using Azure Functions. frontend/WebUx implements the client-side as a static website which interacts with a Azure Functions backend which internally runs the litmus/litmus_mixing.py script.

Instructions to self-host

  1. Create an Azure Functions resource on Azure.
  2. Install Azure CLI and Functions Core Tools
  3. cd into the frontend/ directory and deploy to azure functions using func azure functionapp publish .

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Mycroft Core, the Mycroft Artificial Intelligence platform.

Mycroft Mycroft is a hackable open source voice assistant. Table of Contents Getting Started Running Mycroft Using Mycroft Home Device and Account Man

Mycroft 6.1k Jan 09, 2023
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

blurr A library that integrates huggingface transformers with version 2 of the fastai framework Install You can now pip install blurr via pip install

ohmeow 253 Dec 31, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022