Blazing fast language detection using fastText model

Last update: Dec 20, 2022

Overview

Luga

A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create a smart data checker (a fast List[str], what do with none strings)
make it faster with Cython

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

Comments

fix: Fix invalid pytest dependency version
poetry does not want to accept flake8 as a valid versionFixes issue #13

fix: Fix invalid pytest dependency version

fix: Use fasttext-wheel instead of fasttext
opened by saevarb 1
Installation fails with recent poetry due to `fasttext` issues

Hey!

As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

opened by saevarb 1

SSL WRONG_VERSION_NUMBER

Solution from httpx

import httpx
import ssl

ssl_context = httpx.create_ssl_context()
ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
resp = httpx.get(..., verify=ssl_context)
```

opened by Proteusiq 0

Return array for compatibility with pandas

This fails since pandas expects an array and luga returns a list

texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]

But this works

texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]

opened by nthomsencph 0

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.7-py3-none-any.whl(5.55 KB)
luga-0.2.7.tar.gz(5.34 KB)
v0.2.6(Sep 28, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.6-py3-none-any.whl(5.51 KB)
luga-0.2.6.tar.gz(5.32 KB)
v0.2.5(Apr 19, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.5-py3-none-any.whl(5.50 KB)
luga-0.2.5.tar.gz(5.39 KB)
v0.2.4(Dec 23, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.4-py3-none-any.whl(4.60 KB)
luga-0.2.4.tar.gz(4.52 KB)
v0.2.3(Dec 22, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.3-py3-none-any.whl(4.56 KB)
luga-0.2.3.tar.gz(4.46 KB)
v0.2.2(Dec 3, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.2-py3-none-any.whl(4.42 KB)
luga-0.2.2.tar.gz(4.28 KB)
v0.2.1(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.1-py3-none-any.whl(4.07 KB)
luga-0.2.1.tar.gz(3.95 KB)
v0.2.0(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.0-py3-none-any.whl(4.07 KB)
luga-0.2.0.tar.gz(3.95 KB)
v0.1.8(Nov 20, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.8-py3-none-any.whl(3.88 KB)
luga-0.1.8.tar.gz(3.76 KB)
v0.1.7(Nov 17, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.7-py3-none-any.whl(3.81 KB)
luga-0.1.7.tar.gz(3.66 KB)

Owner

Prayson Wilfred Daniel

🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python

GitHub Repository

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 05, 2021

Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

29 Dec 14, 2022

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

5 Oct 24, 2022

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

82 Dec 19, 2022

Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021

Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

47 Dec 20, 2022

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

438 Dec 31, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

605 Jan 02, 2023

CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集，面向中文文献类（论文）场景。包含以下10个label：正文标题图片图片标题表格表格标题页眉页脚注释公式 Text Title

84 Dec 28, 2022

MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

7 Nov 09, 2022

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

1 Nov 27, 2021

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

26 Dec 14, 2022

A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

42 Dec 27, 2022

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

21 Dec 14, 2022

Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

2 Feb 17, 2022

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

1 Feb 11, 2022

Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

52 Nov 10, 2022

Blazing fast language detection using fastText model

Related tags

Overview

Luga

Installation

Usage:

Comming soon ...

TODO:

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Comments

fix: Fix invalid pytest dependency version

Installation fails with recent poetry due to `fasttext` issues

SSL WRONG_VERSION_NUMBER

Return array for compatibility with pandas

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

v0.2.6(Sep 28, 2022)

v0.2.5(Apr 19, 2022)

v0.2.4(Dec 23, 2021)

v0.2.3(Dec 22, 2021)

v0.2.2(Dec 3, 2021)

v0.2.1(Nov 26, 2021)

v0.2.0(Nov 26, 2021)

v0.1.8(Nov 20, 2021)

v0.1.7(Nov 17, 2021)

Owner

Prayson Wilfred Daniel

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Automated question generation and question answering from Turkish texts using text-to-text transformers

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Yodatranslator is a simple translator English to Yoda-language

Code for text augmentation method leveraging large-scale language models

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

CDLA: A Chinese document layout analysis (CDLA) dataset

MEDIALpy: MEDIcal Abbreviations Lookup in Python

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Text Classification in Turkish Texts with Bert

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

A unified tokenization tool for Images, Chinese and English.

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Text-Based zombie apocalyptic decision-making game in Python

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Code repository for "It's About Time: Analog clock Reading in the Wild"