A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Command Line Text-To-Speech using Google TTS

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

SciBERT is a BERT model trained on scientific text.

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Mlcode - Continuous ML API Integrations

customer care chatbot made with Rasa Open Source.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Pipeline for chemical image-to-text competition

Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

A Streamlit web app that generates Rick and Morty stories using GPT2.

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

justCTF [*] 2020 challenges sources

Collection of useful (to me) python scripts for interacting with napari

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations