A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Global Rhythm Style Transfer Without Text Transcriptions

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Need: Image Search With Python

A library for finding knowledge neurons in pretrained transformer models.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

A demo of chinese asr

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

A versatile token stream for handwritten parsers.

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Translation to python of Chris Sims' optimization function

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP