Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Overview

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

This repository contains code and data for evaluating model performance in crosslinguistic low-resource settings, using morphological segmentation as the test case. For more information, we refer to the paper Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation, to appear in Transactions of the Association for Computational Linguistics.

Arxiv version here

@misc{liu2022datadriven,
      title={Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation}, 
      author={Zoey Liu and Emily Prud'hommeaux},
      year={2022},
      eprint={2201.01845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

Install the following:

(1) Python 3

(2) Morfessor

(3) CRFsuite

(4) OpenNMT

Code

The code directory contains the code applied to conduct the experiments.

Collect initial data

Create a resource folder. This folder is supposed to hold the initial data for each language invited to participate in the experiments. The experiments were performed at different stages, therefore the initial data of different languages have different subdirectories within resource (please excuse this).

The data for three Mexican languages came from this paper.

(1) download the data from the public repository

(2) for each language, combine all the data from the training, development, and test set; this applies to both the *src files and the *tgt files.

(3) rename the combined data file as, e.g., Yorem Nokki: mayo_src, mayo_tgt, Nahuatl: nahuatl_src, nahuatl_tgt.

(4) put the data files within resource

The data for Persian came from here.

(1) download the data from the public repository

(2) combine the training, development, and test set to one data file

(3) rename the combined data file as persian

(4) put the single data file within resource

The data for German, Zulu and Indonesian came from this paper.

(1) download the data from the public repository

(2) put the downloaded supplement folder within resource

The data for English, Russian, Turkish and Finnish came from this repo.

(1) download the git repo

(2) put the downloaded NeuralMorphemeSegmentation folder within resource

Summary of (alternative) Language codes and data directories for running experiments

Yorem Nokki: mayo resources/

Nahuatl: nahuatl resources/

Wixarika: wixarika resources/

English: english/eng resources/NeuralMorphemeSegmentation/morphochal10data/

German: german/ger resources/supplement/seg/ger

Persian: persian resources/

Russian: russian/ru resources/NeuralMorphemeSegmentation/data/

Turkish: turkish/tur resources/NeuralMorphemeSegmentation/morphochal10data/

Finnish: finnish/fin resources/NeuralMorphemeSegmentation/morphochal10data/

Zulu: zulu/zul resources/supplement/seg/zul

Indonesian: indonesian/ind resources/supplement/seg/ind

Basic running of the code

Create experiments folder and subfolders for each language; e.g., Zulu

mkdir experiments

mkdir zulu

Generate data (an example)

with replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r with --k 500

without replacement, data size = 500

python3 code/segmentation_data.py --input resources/supplement/seg/zul/ --output experiments/zulu/ --lang zul --r without --k 500

Training models: Morfessor

Train morfessor models

python3 code/morfessor/morfessor.py --input experiments/zulu/500/with/ --lang zul

python3 code/morfessor/morfessor.py --input experiments/zulu/500/without/ --lang zul

Generate evaluation scrips for morfessor model results

python3 code/morf_shell.py --input experiments/zulu/500/ --lang zul

Evaluate morfessor model results

bash zulu_500_morf_eval.sh

Training models: CRF

Generate CRF shell script

e.g., generating 3-CRF shell script

python3 code/crf_order.py --input experiments/zulu/500/ --lang zul --r with --order 3

Training models: Seq2seq

Generate configuration .yaml files

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r with

python3 code/yaml.py --input experiments/zulu/500/ --lang zul --r without

Generate pbs file (containing also the code to train Seq2seq model)

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r with

python3 code/sirius.py --input experiments/zulu/500/ --lang zul --r without

Gather training results for a given language

Again take Zulu as an example. Make sure that given a data set size (e.g, 500) and a sampling method (e.g., with replacement), there are three subfolders in the folder experiments/zulu/500/with:

(1) morfessor for all *eval* files from Morfessor;

(2) higher_orders for all *eval* files from k-CRF;

(3) seq2seq for all *eval* files from Seq2seq

Then run:

python3 code/gather.py --input experiments/zulu/ --lang zul --short zulu.txt --full zulu_full.txt --long zulu_details.txt

Testing

Testing the best CRF

e.g., 4-CRFs trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_crf.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --order 4 --r with --k 50

Testing the best Seq2seq

e.g., trained from data sets sampled with replacement, for test sets of size 50

python3 code/testing_seq2seq.py --input experiments/zulu/500/ --data resources/supplement/seg/zul/ --lang zul --n 100 --r with --k 50

Do the same for every language

Generating alternative splits

Gather features of data sets, as well as generate heuristic/adversarial data splits

python3 code/heuristics.py --input experiments/zulu/ --lang zul --output yayyy/ --split A --generate

Gather features of new unseen test sets

python3 code/new_test_heuristics.py --input experiments/zulu/ --output yayyy/ --lang zul

Yayyy: Full Results

Get them here

Running analyses and making plots

See code/plot.R for analysis and making fun plots

Owner
Zoey Liu
language, computation, music, food
Zoey Liu
Learning Confidence for Out-of-Distribution Detection in Neural Networks

Learning Confidence Estimates for Neural Networks This repository contains the code for the paper Learning Confidence for Out-of-Distribution Detectio

235 Jan 05, 2023
OMAMO: orthology-based model organism selection

OMAMO: orthology-based model organism selection OMAMO is a tool that suggests the best model organism to study a biological process based on orthologo

Dessimoz Lab 5 Apr 22, 2022
A Python training and inference implementation of Yolov5 helmet detection in Jetson Xavier nx and Jetson nano

yolov5-helmet-detection-python A Python implementation of Yolov5 to detect head or helmet in the wild in Jetson Xavier nx and Jetson nano. In Jetson X

12 Dec 05, 2022
NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

OptiPrompt This is the PyTorch implementation of the paper Factual Probing Is [MASK]: Learning vs. Learning to Recall. We propose OptiPrompt, a simple

Princeton Natural Language Processing 150 Dec 20, 2022
The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

PointNav-VO The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation Project Page | Paper Table of Contents Setup

Xiaoming Zhao 41 Dec 15, 2022
Setup freqtrade/freqUI on Heroku

UNMAINTAINED - REPO MOVED TO https://github.com/p-zombie/freqtrade Creating the app git clone https://github.com/joaorafaelm/freqtrade.git && cd freqt

João 51 Aug 29, 2022
Code for Max-Margin Contrastive Learning - AAAI 2022

Max-Margin Contrastive Learning This is a pytorch implementation for the paper Max-Margin Contrastive Learning accepted to AAAI 2022. This repository

Anshul Shah 12 Oct 22, 2022
Extremely simple and fast extreme multi-class and multi-label classifiers.

napkinXC napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification, that focus of implementing various m

Marek Wydmuch 43 Nov 14, 2022
Must-read Papers on Physics-Informed Neural Networks.

PINNpapers Contributed by IDRL lab. Introduction Physics-Informed Neural Network (PINN) has achieved great success in scientific computing since 2017.

IDRL 330 Jan 07, 2023
Synthesizing and manipulating 2048x1024 images with conditional GANs

pix2pixHD Project | Youtube | Paper Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic image-to-image translatio

NVIDIA Corporation 6k Dec 27, 2022
Algorithmic trading using machine learning.

Algorithmic Trading This machine learning algorithm was built using Python 3 and scikit-learn with a Decision Tree Classifier. The program gathers sto

Sourav Biswas 101 Nov 10, 2022
AI drive app that can help user become beautiful.

爱美丽 Beauty 简体中文 Features Beauty is an AI drive app that can help user become beautiful. it contain those functions: face score cheek face beauty repor

Starved Midnight 1 Jan 30, 2022
K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce (EMNLP Founding 2021)

Introduction K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. Installation PyTor

Xu Song 21 Nov 16, 2022
Unit-Convertor - Unit Convertor Built With Python

Python Unit Converter This project can convert Weigth,length and ... units for y

Mahdis Esmaeelian 1 May 31, 2022
OBG-FCN - implementation of 'Object Boundary Guided Semantic Segmentation'

OBG-FCN This repository is to reproduce the implementation of 'Object Boundary Guided Semantic Segmentation' in http://arxiv.org/abs/1603.09742 Object

Jiu XU 3 Mar 11, 2019
Automated image registration. Registrationimation was too much of a mouthful.

alignimation Automated image registration. Registrationimation was too much of a mouthful. This repo contains the code used for my blog post Alignimat

Ethan Rosenthal 9 Oct 13, 2022
Dynamica causal Bayesian optimisation

Dynamic Causal Bayesian Optimization This is a Python implementation of Dynamic Causal Bayesian Optimization as presented at NeurIPS 2021. Abstract Th

nd308 18 Nov 22, 2022
Code for the paper "Can Active Learning Preemptively Mitigate Fairness Issues?" presented at RAI 2021.

Can Active Learning Preemptively Mitigate Fairness Issues? Code for the paper "Can Active Learning Preemptively Mitigate Fairness Issues?" presented a

ElementAI 7 Aug 12, 2022
Segmentation for medical image.

EfficientSegmentation Introduction EfficientSegmentation is an open source, PyTorch-based segmentation framework for 3D medical image. Features A whol

68 Nov 28, 2022
Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

Ramana Sundararaman 76 Dec 06, 2022