Lingtrain Aligner — ML powered library for the accurate texts alignment.

Last update: Dec 14, 2022

Related tags

Overview

Lingtrain Aligner

ML powered library for the accurate texts alignment in different languages.

Purpose

Main purpose of this alignment tool is to build parallel corpora using two or more raw texts in different languages. Texts should contain the same information (i.e., one text should be a translated analog oh the other text). E.g., it can be the Drei Kameraden by Remarque in German and the Three Comrades — it's translation into English.

Process

There are plenty of obstacles during the alignment process:

The translator could translate several sentences as one.
The translator could translate one sentence as many.
There are some service marks in the text
- Page numbers
- Chapters and other section headings
- Author and title information
- Notes

While service marks can be handled manually (the tool helps to detect them), the translation conflicts should be handled more carefully.

Lingtrain Aligner tool will do almost all alignment work for you. It matches the sentence pairs automatically using the multilingual machine learning models. Then it searches for the alignment conflicts and resolves them. As output you will have the parallel corpora either as two distinct plain text files or as the merged corpora in widely used TMX format.

Supported languages and models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. Supported languages list depend on the selected backend model.

distiluse-base-multilingual-cased-v2
- more reliable and fast
- moderate weights size — 500MB
- supports 50+ languages
- full list of supported languages can be found in this paper
LaBSE (Language-agnostic BERT Sentence Embedding)
- can be used for rare languages
- pretty heavy weights — 1.8GB
- supports 100+ languages
- full list of supported languages can be found here

Profit

Parallel corpora by itself can used as the resource for machine translation models or for linguistic researches.
My personal goal of this project is to help people building parallel translated books for the foreign language learning.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

19 Oct 28, 2022

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

20 Jan 9, 2023

Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

49 Dec 30, 2022

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

1 Oct 5, 2021

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

46 Dec 15, 2022

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 2, 2023

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

10 Oct 13, 2022

Comments

File Already Exists

Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

opened by puffofsmoke 1
Fix XML creation:
prevent parent tag duplication for (langs, author, title)

add tags for tmx export

use 'direction' for splitting paragraphs

do not use bs4 (generates incorrect xml), change to lxml
opened by BorisNA 0
A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

when I use “splitted_from = splitter.split_by_sentences_wrapper(text1_prepared, lang_from)” return list，

But I see that there will be a conflict when insert sqlite ，specific error：

File "ling_test.py", line 36, in aligner.fill_db(db_path, splitted_from, splitted_to) File "lingtrain_aligner/aligner.py", line 498, in fill_db db.executemany("insert into languages(key, val) values(?,?)", [("from", lang_from), ("to", lang_to)]) sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.

opened by Amen-bang 5
Add text splitting into small parts
The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language. You can use this fact and split a big text to small parts.

Next idea - try split a big text to small blocks automatically: Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

You can use the next psedocode:

left_array = original_sentences[100:110] sum=[] for i=50;i<150 do: right_array_candidate=translated_sentences[i:i+10] sum[i]=sum(cosunuse_distance(left_array,right_array_candidate)) rigth_array=get_index_with_max_value(sum) left_text_split_index=left_array[0] rigth_text_split_index=rigth_array[0]
opened by AigizK 0

Releases(0.1.0)

0.1.0(Apr 21, 2021)

The initial release. Already works. Does not have requirements yet.
Source code(tar.gz)
Source code(zip)

Owner

Sergei Averkiev

Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.

GitHub Repository

Deep Learning for Natural Language Processing - Lectures 2021

This repository contains slides for the course "20-00-0947: Deep Learning for Natural Language Processing" (Technical University of Darmstadt, Summer term 2021).

0 Feb 21, 2022

Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

3.1k Dec 29, 2022

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

1 Feb 11, 2022

Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

1 Mar 21, 2022

Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023

The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

1 Feb 17, 2022

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

5 Dec 15, 2022

Material for GW4SHM workshop, 16/03/2022.

GW4SHM Workshop Wednesday, 16th March 2022 (13:00 – 15:15 GMT): Presented by: Dr. Rhodri Nelson, Imperial College London Project website: https://www.

1 Mar 16, 2022

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

114 Dec 29, 2022

A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

10 Jul 17, 2022

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

20 Dec 13, 2022

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

46 Dec 14, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

1 Nov 27, 2021

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Related tags

Overview

Lingtrain Aligner

Purpose

Process

Supported languages and models

Profit

You might also like...

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Biterm Topic Model (BTM): modeling topics in short texts

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Text Classification in Turkish Texts with Bert

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Comments

File Already Exists

Fix XML creation:

A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

Add text splitting into small parts

Releases(0.1.0)

0.1.0(Apr 21, 2021)

Owner

Sergei Averkiev

Deep Learning for Natural Language Processing - Lectures 2021

Python port of Google's libphonenumber

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

Translation to python of Chris Sims' optimization function

Stand-alone language identification system

The aim of this task is to predict someone's English proficiency based on a text input.

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Material for GW4SHM workshop, 16/03/2022.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

A Flask Sentiment Analysis API, with visual implementation

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Machine learning models from Singapore's NLP research community

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

Code examples for my Write Better Python Code series on YouTube.

Collection of useful (to me) python scripts for interacting with napari

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.