pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Overview

pkuseg:一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。

目录

主要亮点

pkuseg具有如下几个特点:

  1. 多领域分词。不同于以往的通用中文分词工具,此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点,用户可以自由地选择不同的模型。 我们目前支持了新闻领域,网络领域,医药领域,旅游领域,以及混合领域的分词预训练模型。在使用中,如果用户明确待分词的领域,可加载对应的模型进行分词。如果用户无法确定具体领域,推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt
  2. 更高的分词准确率。相比于其他的分词工具包,当使用相同的训练数据和测试数据,pkuseg可以取得更高的分词准确率。
  3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。
  4. 支持词性标注。

编译和安装

  • 目前仅支持python3
  • 为了获得好的效果和速度,强烈建议大家通过pip install更新到目前的最新版本
  1. 通过PyPI安装(自带模型文件):

    pip3 install pkuseg
    之后通过import pkuseg来引用
    

    建议更新到最新版本以获得更好的开箱体验:

    pip3 install -U pkuseg
    
  2. 如果PyPI官方源下载速度不理想,建议使用镜像源,比如:
    初次安装:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    更新:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    
  3. 如果不使用pip安装方式,选择从GitHub下载,可运行以下命令安装:

    python setup.py build_ext -i
    

    GitHub的代码并不包括预训练模型,因此需要用户自行下载或训练模型,预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意:安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统,请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较,详细设置可参考实验环境

细领域训练及测试结果

以下是在不同数据集上的对比结果:

MSRA Precision Recall F-score
jieba 87.01 89.88 88.42
THULAC 95.60 95.91 95.71
pkuseg 96.94 96.81 96.88
WEIBO Precision Recall F-score
jieba 87.79 87.54 87.66
THULAC 93.40 92.40 92.87
pkuseg 93.78 94.65 94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候,大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能,我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意,这样的比较只是为了说明默认情况下的效果,并不一定是公平的。

Default MSRA CTB8 PKU WEIBO All Average
jieba 81.45 79.58 81.83 83.56 81.61
THULAC 85.55 87.84 92.29 86.65 88.08
pkuseg 87.29 91.77 92.68 93.43 91.29

其中,All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1:使用默认配置进行分词(如果用户无法确定分词领域,推荐使用默认模型分词

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2:细领域分词(如果用户明确分词领域,推荐使用细领域模型分词

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3:分词同时进行词性标注,各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4:对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)     

其他使用示例可参见详细代码示例

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数,使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		        是否进行词性分析。
				False, 默认参数,只进行分词,不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时,如涉及多进程功能,请务必使用if __name__ == '__main__'保护全局语句,详见多进程分词

预训练模型

从pip安装的用户在使用细领域分词功能时,只需要设置model_name字段为对应的领域即可,会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型,并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明:

  • news: 在MSRA(新闻语料)上训练的模型。

  • web: 在微博(网络文本语料)上训练的模型。

  • medicine: 在医药领域上训练的模型。

  • tourism: 在旅游领域上训练的模型。

  • mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史

开源协议

  1. 本代码采用MIT许可证。
  2. 欢迎对该工具包提出任何宝贵意见和建议,请发邮件至[email protected]

论文引用

该代码包主要基于以下科研论文,如使用了本工具,请引用以下论文:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

  1. 为什么要发布pkuseg?
  2. pkuseg使用了哪些技术?
  3. 无法使用多进程分词和训练功能,提示RuntimeError和BrokenPipeError。
  4. 是如何跟其它工具包在细领域数据上进行比较的?
  5. 在黑盒测试集上进行比较的话,效果如何?
  6. 如果我不了解待分词语料的所属领域呢?
  7. 如何看待在一些特定样例上的分词结果?
  8. 关于运行速度问题?
  9. 关于多进程速度问题?

致谢

感谢俞士汶教授(北京大学计算语言所)与邱立坤博士提供的训练数据集!

作者

Ruixuan Luo (罗睿轩), Jingjing Xu(许晶晶), Xuancheng Ren(任宣丞), Yi Zhang(张艺), Bingzhen Wei(位冰镇), Xu Sun (孙栩)

北京大学 语言计算与机器学习研究组

Comments
  • 与其余分词工具包的性能对比并不公平吧?

    与其余分词工具包的性能对比并不公平吧?

    请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

    如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

    事实上MSRA 分词效果在论文里基本上都超过97.5了。

    opened by jiesutd 31
  • 就比较了一句话的结果就能和jieba一决胜负了

    就比较了一句话的结果就能和jieba一决胜负了

    pkuseg: seg = pkuseg.pkuseg() print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊')) ['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

    jieba: print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')]) ['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

    一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......

    opened by mendynew 8
  • undefined symbol: PyFPE_jbuf

    undefined symbol: PyFPE_jbuf

    ImportError: /root/anaconda3/envs/NLP/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

    ubuntu, pip install pkuseg any ideas?

    opened by LCorleone 5
  • what is the required encode of input file?

    what is the required encode of input file?

    C:\Python36>python
    Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg()           # 以默认配置加载模型
    >>> text = seg.cut('我爱北京天安门')  # 进行分词
    >>> print(text)
    ['我', '爱', '北京', '天安门']
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
    Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\lutao/.pkuseg\
    postag.zip
    100.0%
    >>> text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
    >>> print(text)
    [('我', 'r'), ('爱', 'v'), ('北京', 'ns'), ('天安门', 'ns')]
    >>> import pkuseg
    >>>
    >>> # 对input.txt的文件分词输出到output.txt中
    ... # 开20个进程
    ... pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
      File "<stdin>", line 3
        pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
                                                           ^
    SyntaxError: invalid syntax
    >>> pkuseg.test('c:/user/lutao/downloads/0309a.txt', 'c:/user/lutao/downloads/0309a_output.txt', nthread=10)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 520, in test
        input_file, output_file, nthread, model_name, user_dict, postag, verbose
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 444, in _test_multi_proc
        raise Exception("input_file {} does not exist.".format(input_file))
    Exception: input_file c:/user/lutao/downloads/0309a.txt does not exist.
    

    I replaced '/' with '', and encode of 0309a.txt is gbk

    >>> pkuseg.test('c:\user\lutao\downloads\0309a.txt', 'c:\user\lutao\downloads\0309a_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    

    I save 0309a.txt to 0309b.txt as utf-8 encode,

    >>> pkuseg.test('c:\user\lutao\downloads\0309b.txt', 'c:\user\lutao\downloads\0309b_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    
    opened by l1t1 5
  • python3.6 import 失败

    python3.6 import 失败

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse g Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package s (0.0.14) Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages (from pkuseg) (1.13.3+mkl)

    python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

    import pkuseg Traceback (most recent call last): File "", line 1, in File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i n import pkuseg.trainer as trainer File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

    import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject
    opened by tangchun 5
  • 这个是什么问题导致的?

    这个是什么问题导致的?

    length = 1 : 0 length = 2 : 2496 length = 3 : 2642 length = 4 : 2568 length = 5 : 1313 length = 6 : 633 length = 7 : 249 length = 8 : 133 length = 9 : 66 length = 10 : 16 length = 11 : 6 length = 12 : 1 length = 13 : 1

    start training...

    reading training & test data... done! train/test data sizes: 1/1

    r: 1 iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06% iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00% Traceback (most recent call last): File "test.py", line 8, in pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train trainer.train(config) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train score_list = trainer.test(testset, i) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test testset, self.model, writer File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore gold_tags, pred_tags, self.idx_to_chunk_tag File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore pre = correct_chunk / res_chunk * 100 ZeroDivisionError: division by zero

    opened by Fabyone 4
  • ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

    import pkuseg

    seg = pkuseg.pkuseg()

    text = "我爱北京天安门"

    cut = seg.cut(text) print(cut)

    Traceback (most recent call last): File "E:/python/work/spider/bx/piggy.py", line 1, in import pkuseg File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in import pkuseg.trainer File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in import pkuseg.inference as _inf File "init.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    opened by xhochipe 4
  • FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    安装了pkuseg 初次使用,需要下载postag.zip 下载失败 我就自己下载,并放到文件夹下 但是有报错FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    opened by hjing100 3
  • 0.0.25在binder安装报错

    0.0.25在binder安装报错

    0.0.22 可以正常安装

    Collecting numpy
      Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
    Collecting pkuseg
      Downloading pkuseg-0.0.25.tar.gz (48.8 MB)
        ERROR: Command errored out with exit status 1:
         command: /srv/conda/envs/notebook/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-h22vfd4x
             cwd: /tmp/pip-install-5d95j8mq/pkuseg/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-5d95j8mq/pkuseg/setup.py", line 5, in <module>
            import numpy as np
        ModuleNotFoundError: No module named 'numpy'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    opened by GoooIce 3
  • pip安装 使用细分领域模型 报错?

    pip安装 使用细分领域模型 报错?

    Traceback (most recent call last): 9 File "py3_cook_corpus_embedding.py", line 18, in <module> 10 seg = pkuseg.pkuseg(model_name='medicine') 11 File "/home/work/software/anaconda3/envs/py3myhao/lib/python3.6/site-packages/pkuseg/__init__.py", line 224, in __init__ 12 self.feature_extractor = FeatureExtractor.load() 13 File "pkuseg/feature_extractor.pyx", line 625, in pkuseg.feature_extractor.FeatureExtractor.load 14 FileNotFoundError: [Errno 2] No such file or directory: 'medicine/unigram_word.txt'

    另外,使用细分模型后,可以同时加上自定义词表吗?

    opened by kinghmy 3
  • wsl2 + pyenv + python3.8.5 安装报错.

    wsl2 + pyenv + python3.8.5 安装报错.

    (fastApi-env) [email protected]:/mnt/c/Users/Administrator$ pip install pkuseg Looking in indexes: http://mirrors.aliyun.com/pypi/simple Collecting pkuseg Downloading http://mirrors.aliyun.com/pypi/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB) |████████████████████████████████| 48.8 MB 79.6 MB/s ERROR: Command errored out with exit status 1: command: /home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"'; file='"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-99zrwcbj cwd: /tmp/pip-install-hjb0015_/pkuseg/ Complete output (36 lines): WARNING: The wheel package is not available. WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host mirrors.aliyun.com'. ERROR: Could not find a version that satisfies the requirement cython (from versions: none) ERROR: No matching distribution found for cython Traceback (most recent call last): File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/home/xiaxichen/.pyenv/versions/3.8.5/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 63, in <module>
        setup_package()
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 39, in setup_package
        setuptools.setup(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 162, in setup
        _install_setup_requires(attrs)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 157, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 699, in fetch_build_eggs
        resolved_dists = pkg_resources.working_set.resolve(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 779, in resolve
        dist = best[req.key] = env.best_match(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1064, in best_match
        return self.obtain(req, installer)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1076, in obtain
        return installer(requirement)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 758, in fetch_build_egg
        return fetch_build_egg(self, req)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 130, in fetch_build_egg
        raise DistutilsError(str(e)) from e
    distutils.errors.DistutilsError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

    opened by xiaxichen 2
  • cannot install in the environment of python 3.9

    cannot install in the environment of python 3.9

    Dear Sirs or Madams, the installation to the environment of python 3.9 failed. I check your repository at 'https://pypi.tuna.tsinghua.edu.cn/simple/pkuseg/'. It seems that there are no python 3.9 relevant files there. do you have any plan to support python 3.9? I also saw that in other issues, you suggested the file relevant to python 3.9. I cannot find the file. Your reply is highly appreciated. Tony

    opened by tonydeck0506 2
  • 词性标注效果过好

    词性标注效果过好

    理论上来讲效果好是一件好事,但是实际测试来讲会把不存在的地名也认作为地名

    import pkuseg
    seg = pkuseg.pkuseg(postag=True)
    text = seg.cut('广场镇是河北天津衡水冲绳东京的旧地狱和亚特兰斯地吗?')
    for word, flag in text: 
        if flag == 'ns':
            print (word)
    

    输出结果为:

    广场镇
    河北
    天津
    衡水
    冲绳
    东京
    亚特兰斯
    
    opened by axty666 1
  • TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    
    import pkuseg
    
    # 训练文件为'train.txt'
    # 测试文件为'test.txt'
    # 加载'./pretrained'目录下的模型,训练好的模型保存在'./models',训练10轮
    pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')
    
    
    opened by KangChou 1
Releases(v0.0.25)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

Md. Rakibul Islam 1 Jan 18, 2022
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022