Crosslingual Segmental Language Model

This repository contains the code from Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (2021, C.M. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral). The code here is a modified version of the repository from the original MSLM paper. The mslm package can be used to train and use Segmental Language Models.

In this repository, we additionally make available our preparation of the AmericasNLP 2021 multilingual dataset (see Data/AmericasNLP) and the target K'iche' data (Data/GlobalClassroom).

Paper Results

The results from the accompanying paper can be found in the Output directory. *.csv files include statistics from the training run, *.out contain the model output for the entire corpus, *.score contain the segmentation scores of the model output.

The results from the October 2021 pre-print (which we will refer to as Experiment Set A) are reproducible on commit 2b89575. We will consider this the official commit of the October 2021 pre-print.

Usage

The top-level scripts for training and experimentation can be found in RunScripts. Almost all functionality is run through the __main__.py script in the mslm package, which can either train or evaluate/use a model. The PyTorch modules for building SLMs can be found in mslm.segmental_lm, modules for the span-masking Transformer are in mslm.segmental_transformer, and modules for sequence lattice-based computations are in mslm.lattice. The main script takes in a configuration object to set most parameters for model training and use (see mslm.mslm_config). For information on the arguments to the main script:

python -m mslm --help

Environment setup

pip install -r requirements.txt

This code requires Python >= 3.6

Training

./RunScripts/run_mslm.sh

python -m mslm --input_file 
   
     \
    --model_path 
    
      \
    --mode train \
    --config_file 
     
       \
    --dev_file 
      
        \
    [--preexisting]

Evaluation

./RunScripts/eval_mslm.sh

Where is a text file containing all of the words from the training set

Crosslingual Segmental Language Model

Related tags

Overview

Crosslingual Segmental Language Model

Paper Results

Usage

Environment setup

Training

Evaluation

Owner

C.M. Downey

Enhancing Knowledge Tracing via Adversarial Training

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)

RATCHET is a Medical Transformer for Chest X-ray Diagnosis and Reporting

EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

Train DeepLab for Semantic Image Segmentation

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

On Evaluation Metrics for Graph Generative Models

Structure Information is the Key: Self-Attention RoI Feature Extractor in 3D Object Detection

Principled Detection of Out-of-Distribution Examples in Neural Networks

Pytorch Implementation of PointNet and PointNet++++

Caffe implementation for Hu et al. Segmentation for Natural Language Expressions

A PyTorch implementation of PointRend: Image Segmentation as Rendering

PyTorch implementation of the TTC algorithm

No-reference Image Quality Assessment(NIQA) Algorithms (BRISQUE, NIQE, PIQE, RankIQA, MetaIQA)

Codebase for INVASE: Instance-wise Variable Selection - 2019 ICLR

Anchor Retouching via Model Interaction for Robust Object Detection in Aerial Images

Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

A Streamlit component to render ECharts.

pytorch implementation of openpose including Hand and Body Pose Estimation.