MultiLexNorm 2021 competition system from ÚFAL

Last update: Jun 28, 2022

Overview

ÚFAL at MultiLexNorm 2021:
Improving Multilingual Lexical Normalization by Fine-tuning ByT5

David Samuel & Milan Straka

Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics

Paper (TODO)
Interactive demo on Google Colab
HuggingFace models (TODO)

This is the official repository for the winning entry to the W-NUT 2021: Multilingual Lexical Normalization (MultiLexNorm) shared task, which evaluates lexical-normalization systems on 12 social media datasets in 11 languages.

Our system is based on ByT5, which we first pre-train on synthetic data and then fine-tune on authentic normalization data. It achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. In addition to these source files, we also release the fine-tuned models on HuggingFace (TODO) and an interactive demo on Google Colab.

How to run

🐾 Clone repository and install the Python requirements

git clone https://github.com/ufal/multilexnorm2021.git
cd multilexnorm2021

pip3 install -r requirements.txt

🐾 Initialize

Run the inialization script to download the official MultiLexNorm data together with a dump of English Wikipedia. We recommend downloading Wikipidia dumps to get clean multi-lingual data, but other data sources should also work.

./initialize.sh

🐾 Train

To train a model for English lexical normalization, simply run the following script. Other configurations are located in the config folder.

python3 train.py --config config/en.yaml

Please cite the following publication

@inproceedings{wnut-ufal,
  title= "{ÚFAL} at {MultiLexNorm} 2021: Improving Multilingual Lexical Normalization by Fine-tuning {ByT5}",
  author = "Samuel, David and Straka, Milan",
  booktitle = "Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021)",
  year = "2021",
  publisher = "Association for Computational Linguistics",
  address = "Punta Cana, Dominican Republic"
}

You might also like...

My published benchmark for a Kaggle Simulations Competition

Lux AI Working Title Bot Please refer to the Kaggle notebook for the comment section. The comment section contains my explanation on my code structure

29 Aug 22, 2022

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

alphamev-winning-submission Top #1 Submission code for the first alphamev MEV competition with best AUC (0.9893) and MSE (0.0982). The code won't run

70 Oct 29, 2022

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Box_Discretization_Network This repository is built on the pytorch [maskrcnn_benchmark]. The method is the foundation of our ReCTs-competition method

266 Nov 24, 2022

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

FPT_data_centric_competition - Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

2 Oct 30, 2022

MultiLexNorm 2021 competition system from ÚFAL

Related tags

Overview

ÚFAL at MultiLexNorm 2021:Improving Multilingual Lexical Normalization by Fine-tuning ByT5

ÚFAL at MultiLexNorm 2021:

How to run

🐾 Clone repository and install the Python requirements

🐾 Initialize

🐾 Train

Please cite the following publication

You might also like...

My published benchmark for a Kaggle Simulations Competition

Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Team nan solution repository for FPT data-centric competition. Data augmentation, Albumentation, Mosaic, Visualization, KNN application

Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Job-Recommend-Competition - Vectorwise Interpretable Attentions for Multimodal Tabular Data

Group project for MFIN7036. Our goal is to predict firm profitability with text-based competition measures.

Data visualization app for H&M competition in kaggle

This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

Releases(v1.0.0)

v1.0.0(Dec 5, 2021)

Owner

ÚFAL

Chainer Implementation of Semantic Segmentation using Adversarial Networks

[CVPR 2020] Interpreting the Latent Space of GANs for Semantic Face Editing

A machine learning malware analysis framework for Android apps.

FID calculation with proper image resizing and quantization steps

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

HyperaPy: An automatic hyperparameter optimization framework ⚡🚀

GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily

Adversarial Learning for Semi-supervised Semantic Segmentation, BMVC 2018

Husein pet projects in here!

[2021][ICCV][FSNet] Full-Duplex Strategy for Video Object Segmentation

Modeling CNN layers activity with Gaussian mixture model

Data augmentation for NLP, accepted at EMNLP 2021 Findings

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

Official code for 'Pixel-wise Energy-biased Abstention Learning for Anomaly Segmentationon Complex Urban Driving Scenes'

Doge-Prediction - Coding Club prediction ig

A simple interface for editing natural photos with generative neural networks.

Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

Enabling dynamic analysis of Legacy Embedded Systems in full emulated environment

AirCode: A Robust Object Encoding Method

ÚFAL at MultiLexNorm 2021:
Improving Multilingual Lexical Normalization by Fine-tuning ByT5