Chinese segmentation library

Last update: Jun 28, 2022

Related tags

Overview

What is loso?

loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc.

Copyright & Licnese

Setup loso

To install loso, clone the repo and run following command

cd loso
python setup.py develop

Also, you need to run a redis database for storing the lexicon database. Also, you need to copy configuration template and modify it.

cp default.yaml myconf.yaml
vim myconf.yaml

To use your configuration, you have to set the configuration environment variable LOSO_CONFIG_FILE. For example:

LOSO_CONFIG_FILE=myconfig.yaml python setup.py server

Use loso

Loso determines segmentation according to the lexicon database, and the algorithm is based on Hidden Makov Model, therefore, it is not possible to use the service before building a lexicon database.

To feed a text file to the database, here you can run

python setup.py feed -f /home/victorlin/plurk_src/realtime_search/word_segment/sample_data/sample_tr_ch

To clean the database, you can run

python setup.py reset

To interact and test for splitting terms, here you can run

python setup.py interact

For example

Text: 留下鉅細靡遺的太空梭發射影片，供世人回味
....
留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

To use the segmentation service as XMLRPC service, here you can run

python setup.py serve

Following is a simple Python program for showing how to use it

import xmlrpclib

proxy = xmlrpclib.ServerProxy("http://localhost:5566/")

terms = proxy.splitTerms(u'留下鉅細靡遺的太空梭發射影片，供世人回味')
print ' '.join(terms)

And the output should be

留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

Chinese segmentation library

Related tags

Overview

What is loso?

Copyright & Licnese

Setup loso

Use loso

Owner

Fang-Pen Lin

Code voor mijn Master project omtrent VideoBERT

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

A Python script which randomly chooses and prints a file from a directory.

The aim of this task is to predict someone's English proficiency based on a text input.

Open source annotation tool for machine learning practitioners.

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

Learning Spatio-Temporal Transformer for Visual Tracking

Modified GPT using average pooling to reduce the softmax attention memory constraints.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

A collection of GNN-based fake news detection models.

【原神】自动演奏风物之诗琴的程序

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Deep Learning for Natural Language Processing - Lectures 2021

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing