a chinese segment base on crf

Related tags

Text Data & NLPgenius
Overview

Genius

Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。

Feature

  • 支持python2.x、python3.x以及pypy2.x。
  • 支持简单的pinyin分词
  • 支持用户自定义break
  • 支持用户自定义合并词典
  • 支持词性标注

Source Install

  • 安装git: 1) ubuntu or debian apt-get install git 2) fedora or redhat yum install git
  • 下载代码:git clone https://github.com/duanhongyi/genius.git
  • 安装代码:python setup.py install

Pypi Install

  • 执行命令:easy_install genius或者pip install genius

Algorithm

  • 采用trie树进行合并词典查找
  • 基于wapiti实现条件随机场分词
  • 可以通过genius.loader.ResourceLoader来重载默认的字典

功能 1):分词genius.seg_text方法

  • genius.seg_text函数接受5个参数,其中text是必填参数:
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_tagging代表是否进行词性标注,默认值True
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值True

代码示例( 全功能分词 )

#encoding=utf-8
import genius
text = u"""昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True,
    use_break=True
)
print('\n'.join(['%s\t%s' % (word.text, word.tagging) for word in seg_list]))

功能 2):面向索引分词

  • genius.seg_keywords方法专门为搜索引擎索引准备,保留歧义分割,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_tagging代表是否进行词性标注,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False
  • 由于合并操作与此方法有意义上的冲突,此方法并不提供合并功能;并且如果采用此方法做索引时候,检索时不推荐genius.seg_text使用use_combine=True参数。

代码示例

#encoding=utf-8
import genius

seg_list = genius.seg_keywords(u'南京市长江大桥')
print('\n'.join([word.text for word in seg_list]))

功能 3):关键词提取

  • genius.extract_tag方法专门为提取tag关键字准备,其中text是必填参数。
  • text第一个参数为需要分词的字符
  • use_break代表对分词结构进行打断处理,默认值True
  • use_combine代表是否使用字典进行词合并,默认值False
  • use_pinyin_segment代表是否对拼音进行分词处理,默认值False

代码示例

#encoding=utf-8
import genius

tag_list = genius.extract_tag(u'南京市长江大桥')
print('\n'.join(tag_list))

其他说明 4):

  • 目前分词语料出自人民日报1998年1月份,所以对于新闻类文章分词较为准确。
  • CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度,若解决语料问题分词和标注效果还有很大的提升空间。
Owner
duanhongyi
a simple programmer.
duanhongyi
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

SilkyArcTool English Dual languaged (rus+eng) GUI tool for packing and unpacking archives of Silky Engine. It is not the same arc as used in Ai6WIN. I

Tester 5 Sep 15, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

Aakash Tripathi 1 Oct 28, 2021
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 88 Dec 23, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022