Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

End-to-End Speech Processing Toolkit

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Topic Inference with Zeroshot models

Python SDK for working with Voicegain Speech-to-Text

Simple NLP based project without any use of AI

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

Learning Spatio-Temporal Transformer for Visual Tracking

Pytorch NLP library based on FastAI

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

【原神】自动演奏风物之诗琴的程序

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

pytorch implementation of Attention is all you need