Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Overview

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

This repository contains code and data for evaluating model performance in crosslinguistic low-resource settings, using morphological segmentation as the test case. For more information, we refer to the paper Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation, to appear in Transactions of the Association for Computational Linguistics.

Arxiv version here

@misc{liu2022datadriven,
      title={Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation}, 
      author={Zoey Liu and Emily Prud'hommeaux},
      year={2022},
      eprint={2201.01845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

Install the following:

(1) Python 3

(2) Morfessor

(3) CRFsuite

(4) OpenNMT

Code

The code directory contains the code applied to conduct the experiments.

Collect initial data

Create a resource folder. This folder is supposed to hold the initial data for each language invited to participate in the experiments. The experiments were performed at different stages, therefore the initial data of different languages have different subdirectories within resource (please excuse this).

The data for three Mexican languages came from this paper.

(1) download the data from the public repository

(2) for each language, combine all the data from the training, development, and test set; this applies to both the *src files and the *tgt files.

(3) rename the combined data file as, e.g., Yorem Nokki: mayo_src, mayo_tgt, Nahuatl: nahuatl_src, nahuatl_tgt.

(4) put the data files within resource

The data for Persian came from here.

(1) download the data from the public repository

(2) combine the training, development, and test set to one data file

(3) rename the combined data file as persian

(4) put the single data file within resource

The data for German, Zulu and Indonesian came from this paper.

(1) download the data from the public repository

(2) put the downloaded supplement folder within resource

The data for English, Russian, Turkish and Finnish came from this repo.

(1) download the git repo

(2) put the downloaded NeuralMorphemeSegmentation folder within resource

Summary of (alternative) Language codes and data directories for running experiments

Yorem Nokki: mayo resources/

Nahuatl: nahuatl resources/

Wixarika: wixarika resources/

English: english/eng resources/NeuralMorphemeSegmentation/morphochal10data/

German: german/ger resources/supplement/seg/ger

Persian: persian resources/

Russian: russian/ru resources/NeuralMorphemeSegmentation/data/

Turkish: turkish/tur resources/NeuralMorphemeSegmentation/morphochal10data/

Finnish: finnish/fin resources/NeuralMorphemeSegmentation/morphochal10data/

Zulu: zulu/zul resources/supplement/seg/zul

Indonesian: indonesian/ind resources/supplement/seg/ind

Basic running of the code

Create experiments folder and subfolders for each language; e.g., Zulu

mkdir experiments

mkdir zulu

Generate data (an example)

with replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r with --k 500

without replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r without --k 500

Training models: Morfessor

Train morfessor models

python3 code/morfessor/morfessor.py --input experiments/zulu/500/with/ --lang zul

python3 code/morfessor/morfessor.py --input experiments/zulu/500/without/ --lang zul

Generate evaluation scrips for morfessor model results

python3 code/morf_shell.py --input experiments/zulu/500/ --lang zul

Evaluate morfessor model results

bash zulu_500_morf_eval.sh

Training models: CRF

Generate CRF shell script

e.g., generating 3-CRF shell script

python3 code/crf_order.py --input experiments/zulu/500/ --lang zul --r with --order 3

Training models: Seq2seq

Generate configuration .yaml files

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r with

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r without

Generate pbs file (containing also the code to train Seq2seq model)

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r with

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r without

Gather training results for a given language

Again take Zulu as an example. Make sure that given a data set size (e.g, 500) and a sampling method (e.g., with replacement), there are three subfolders in the folder experiments/zulu/500/with:

(1) morfessor for all *eval* files from Morfessor;

(2) higher_orders for all *eval* files from k-CRF;

(3) seq2seq for all *eval* files from Seq2seq

Then run:

python3 code/gather.py --input experiments/zulu/ --lang zul --short zulu.txt --full zulu_full.txt --long zulu_details.txt

Testing

Testing the best CRF

e.g., 4-CRFs trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_crf.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --order 4 --r with --k 50

Testing the best Seq2seq

e.g., trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_seq2seq.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --r with --k 50

Do the same for every language

Generating alternative splits

Gather features of data sets, as well as generate heuristic/adversarial data splits

python3 code/heuristics.py --input experiments/zulu/ --lang zul --output yayyy/ --split A --generate

Gather features of new unseen test sets

python3 code/new_test_heuristics.py --input experiments/zulu/ --output yayyy/ --lang zul

Yayyy: Full Results

Get them here

Running analyses and making plots

See code/plot.R for analysis and making fun plots

Owner
Zoey Liu
language, computation, music, food
Zoey Liu
A higher performance pytorch implementation of DeepLab V3 Plus(DeepLab v3+)

A Higher Performance Pytorch Implementation of DeepLab V3 Plus Introduction This repo is an (re-)implementation of Encoder-Decoder with Atrous Separab

linhua 326 Nov 22, 2022
A pytorch implementation of Pytorch-Sketch-RNN

Pytorch-Sketch-RNN A pytorch implementation of https://arxiv.org/abs/1704.03477 In order to draw other things than cats, you will find more drawing da

Alexis David Jacq 172 Dec 12, 2022
"NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search".

NAS-Bench-301 This repository containts code for the paper: "NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search". The

AutoML-Freiburg-Hannover 57 Nov 30, 2022
Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

Trading Gym Trading Gym is an open-source project for the development of reinforcement learning algorithms in the context of trading. It is currently

Dimitry Foures 535 Nov 15, 2022
Face and Body Tracking for VRM 3D models on the web.

Kalidoface 3D - Face and Full-Body tracking for Vtubing on the web! A sequal to Kalidoface which supports Live2D avatars, Kalidoface 3D is a web app t

Rich 257 Jan 02, 2023
An Ensemble of CNN (Python 3.5.1 Tensorflow 1.3 numpy 1.13)

An Ensemble of CNN (Python 3.5.1 Tensorflow 1.3 numpy 1.13)

0 May 06, 2022
Unsupervised clustering of high content screen samples

Microscopium Unsupervised clustering and dataset exploration for high content screens. See microscopium in action Public dataset BBBC021 from the Broa

60 Dec 05, 2022
Sign Language Translation with Transformers (COLING'2020, ECCV'20 SLRTP Workshop)

transformer-slt This repository gathers data and code supporting the experiments in the paper Better Sign Language Translation with STMC-Transformer.

Kayo Yin 107 Dec 27, 2022
Adversarial Autoencoders

Adversarial Autoencoders (with Pytorch) Dependencies argparse time torch torchvision numpy itertools matplotlib Create Datasets python create_datasets

Felipe Ducau 188 Jan 01, 2023
Learning Time-Critical Responses for Interactive Character Control

Learning Time-Critical Responses for Interactive Character Control Abstract This code implements the paper Learning Time-Critical Responses for Intera

Movement Research Lab 227 Dec 31, 2022
Progressive Coordinate Transforms for Monocular 3D Object Detection

Progressive Coordinate Transforms for Monocular 3D Object Detection This repository is the official implementation of PCT. Introduction In this paper,

58 Nov 06, 2022
Deep metric learning methods implemented in Chainer

Deep Metric Learning Implementation of several methods for deep metric learning in Chainer v4.2.0. Proxy-NCA: No Fuss Distance Metric Learning using P

ronekko 156 Nov 28, 2022
CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices.

CenterFace Introduce CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices. Recent Update 2019.09.

StarClouds 1.2k Dec 21, 2022
Rafael Project- Classifying rockets to different types using data science algorithms.

Rocket-Classify Rafael Project- Classifying rockets to different types using data science algorithms. In this project we received data base with data

Hadassah Engel 5 Sep 18, 2021
A PyTorch implementation of EfficientNet and EfficientNetV2 (coming soon!)

EfficientNet PyTorch Quickstart Install with pip install efficientnet_pytorch and load a pretrained EfficientNet with: from efficientnet_pytorch impor

Luke Melas-Kyriazi 7.2k Jan 06, 2023
A Comparative Review of Recent Kinect-Based Action Recognition Algorithms (TIP2020, Matlab codes)

A Comparative Review of Recent Kinect-Based Action Recognition Algorithms This repo contains: the HDG implementation (Matlab codes) for 'Analysis and

Lei Wang 5 Oct 22, 2022
Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm.

REDQ source code Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm. Paper link: https://arxiv.org/abs/2101.05

109 Dec 16, 2022
CBKH: The Cornell Biomedical Knowledge Hub

Cornell Biomedical Knowledge Hub (CBKH) CBKG integrates data from 18 publicly available biomedical databases. The current version of CBKG contains a t

44 Dec 21, 2022
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [WIP] Unofficial Pytorch implementation of AdaSpeech 2. Requirements : All code written i

Rishikesh (ऋषिकेश) 63 Dec 28, 2022
This is a project based on ConvNets used to identify whether a road is clean or dirty. We have used MobileNet as our base architecture and the weights are based on imagenet.

PROJECT TITLE: CLEAN/DIRTY ROAD DETECTION USING TRANSFER LEARNING Description: This is a project based on ConvNets used to identify whether a road is

Faizal Karim 3 Nov 06, 2022