Lingtrain Aligner — ML powered library for the accurate texts alignment.

Overview

Lingtrain Aligner

ML powered library for the accurate texts alignment in different languages.

Cover

Purpose

Main purpose of this alignment tool is to build parallel corpora using two or more raw texts in different languages. Texts should contain the same information (i.e., one text should be a translated analog oh the other text). E.g., it can be the Drei Kameraden by Remarque in German and the Three Comrades — it's translation into English.

Process

There are plenty of obstacles during the alignment process:

  • The translator could translate several sentences as one.
  • The translator could translate one sentence as many.
  • There are some service marks in the text
    • Page numbers
    • Chapters and other section headings
    • Author and title information
    • Notes

While service marks can be handled manually (the tool helps to detect them), the translation conflicts should be handled more carefully.

Lingtrain Aligner tool will do almost all alignment work for you. It matches the sentence pairs automatically using the multilingual machine learning models. Then it searches for the alignment conflicts and resolves them. As output you will have the parallel corpora either as two distinct plain text files or as the merged corpora in widely used TMX format.

Supported languages and models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. Supported languages list depend on the selected backend model.

  • distiluse-base-multilingual-cased-v2
    • more reliable and fast
    • moderate weights size — 500MB
    • supports 50+ languages
    • full list of supported languages can be found in this paper
  • LaBSE (Language-agnostic BERT Sentence Embedding)
    • can be used for rare languages
    • pretty heavy weights — 1.8GB
    • supports 100+ languages
    • full list of supported languages can be found here

Profit

  • Parallel corpora by itself can used as the resource for machine translation models or for linguistic researches.
  • My personal goal of this project is to help people building parallel translated books for the foreign language learning.
You might also like...
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Biterm Topic Model (BTM): modeling topics in short texts
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Text Classification in Turkish Texts with Bert
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

Code for our paper
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

A pytorch implementation of the ACL2019 paper
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Comments
  • File Already Exists

    File Already Exists

    Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

    image

    После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

    opened by puffofsmoke 1
  • Fix XML creation:

    Fix XML creation:

    • prevent parent tag duplication for (langs, author, title)
    • add tags for tmx export
    • use 'direction' for splitting paragraphs
    • do not use bs4 (generates incorrect xml), change to lxml
    opened by BorisNA 0
  • A error when I use “splitter.split_by_sentences_wrapper”,please help check the error

    A error when I use “splitter.split_by_sentences_wrapper”,please help check the error

    when I use “splitted_from = splitter.split_by_sentences_wrapper(text1_prepared, lang_from)” return list,

    But I see that there will be a conflict when insert sqlite ,specific error:

    File "ling_test.py", line 36, in aligner.fill_db(db_path, splitted_from, splitted_to) File "lingtrain_aligner/aligner.py", line 498, in fill_db db.executemany("insert into languages(key, val) values(?,?)", [("from", lang_from), ("to", lang_to)]) sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.

    opened by Amen-bang 5
  • Add text splitting into small parts

    Add text splitting into small parts

    The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language. You can use this fact and split a big text to small parts.

    Next idea - try split a big text to small blocks automatically: Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

    You can use the next psedocode:

    left_array = original_sentences[100:110]
    sum=[]
    for i=50;i<150 do:
       right_array_candidate=translated_sentences[i:i+10]
       sum[i]=sum(cosunuse_distance(left_array,right_array_candidate))
    
    rigth_array=get_index_with_max_value(sum)
    
    left_text_split_index=left_array[0]
    rigth_text_split_index=rigth_array[0]
    
    opened by AigizK 0
Releases(0.1.0)
Owner
Sergei Averkiev
Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.
Sergei Averkiev
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
BERT score for text generation

BERTScore Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). News: Features to appear in

Tianyi 1k Jan 08, 2023
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

Argos Open Tech 16 Dec 07, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022