pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Overview

pkuseg:一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。

目录

主要亮点

pkuseg具有如下几个特点:

  1. 多领域分词。不同于以往的通用中文分词工具,此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点,用户可以自由地选择不同的模型。 我们目前支持了新闻领域,网络领域,医药领域,旅游领域,以及混合领域的分词预训练模型。在使用中,如果用户明确待分词的领域,可加载对应的模型进行分词。如果用户无法确定具体领域,推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt
  2. 更高的分词准确率。相比于其他的分词工具包,当使用相同的训练数据和测试数据,pkuseg可以取得更高的分词准确率。
  3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。
  4. 支持词性标注。

编译和安装

  • 目前仅支持python3
  • 为了获得好的效果和速度,强烈建议大家通过pip install更新到目前的最新版本
  1. 通过PyPI安装(自带模型文件):

    pip3 install pkuseg
    之后通过import pkuseg来引用
    

    建议更新到最新版本以获得更好的开箱体验:

    pip3 install -U pkuseg
    
  2. 如果PyPI官方源下载速度不理想,建议使用镜像源,比如:
    初次安装:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    更新:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    
  3. 如果不使用pip安装方式,选择从GitHub下载,可运行以下命令安装:

    python setup.py build_ext -i
    

    GitHub的代码并不包括预训练模型,因此需要用户自行下载或训练模型,预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意:安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统,请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较,详细设置可参考实验环境

细领域训练及测试结果

以下是在不同数据集上的对比结果:

MSRA Precision Recall F-score
jieba 87.01 89.88 88.42
THULAC 95.60 95.91 95.71
pkuseg 96.94 96.81 96.88
WEIBO Precision Recall F-score
jieba 87.79 87.54 87.66
THULAC 93.40 92.40 92.87
pkuseg 93.78 94.65 94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候,大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能,我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意,这样的比较只是为了说明默认情况下的效果,并不一定是公平的。

Default MSRA CTB8 PKU WEIBO All Average
jieba 81.45 79.58 81.83 83.56 81.61
THULAC 85.55 87.84 92.29 86.65 88.08
pkuseg 87.29 91.77 92.68 93.43 91.29

其中,All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1:使用默认配置进行分词(如果用户无法确定分词领域,推荐使用默认模型分词

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2:细领域分词(如果用户明确分词领域,推荐使用细领域模型分词

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3:分词同时进行词性标注,各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4:对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)     

其他使用示例可参见详细代码示例

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数,使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		        是否进行词性分析。
				False, 默认参数,只进行分词,不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时,如涉及多进程功能,请务必使用if __name__ == '__main__'保护全局语句,详见多进程分词

预训练模型

从pip安装的用户在使用细领域分词功能时,只需要设置model_name字段为对应的领域即可,会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型,并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明:

  • news: 在MSRA(新闻语料)上训练的模型。

  • web: 在微博(网络文本语料)上训练的模型。

  • medicine: 在医药领域上训练的模型。

  • tourism: 在旅游领域上训练的模型。

  • mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史

开源协议

  1. 本代码采用MIT许可证。
  2. 欢迎对该工具包提出任何宝贵意见和建议,请发邮件至[email protected]

论文引用

该代码包主要基于以下科研论文,如使用了本工具,请引用以下论文:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

  1. 为什么要发布pkuseg?
  2. pkuseg使用了哪些技术?
  3. 无法使用多进程分词和训练功能,提示RuntimeError和BrokenPipeError。
  4. 是如何跟其它工具包在细领域数据上进行比较的?
  5. 在黑盒测试集上进行比较的话,效果如何?
  6. 如果我不了解待分词语料的所属领域呢?
  7. 如何看待在一些特定样例上的分词结果?
  8. 关于运行速度问题?
  9. 关于多进程速度问题?

致谢

感谢俞士汶教授(北京大学计算语言所)与邱立坤博士提供的训练数据集!

作者

Ruixuan Luo (罗睿轩), Jingjing Xu(许晶晶), Xuancheng Ren(任宣丞), Yi Zhang(张艺), Bingzhen Wei(位冰镇), Xu Sun (孙栩)

北京大学 语言计算与机器学习研究组

Comments
  • 与其余分词工具包的性能对比并不公平吧?

    与其余分词工具包的性能对比并不公平吧?

    请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

    如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

    事实上MSRA 分词效果在论文里基本上都超过97.5了。

    opened by jiesutd 31
  • 就比较了一句话的结果就能和jieba一决胜负了

    就比较了一句话的结果就能和jieba一决胜负了

    pkuseg: seg = pkuseg.pkuseg() print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊')) ['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

    jieba: print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')]) ['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

    一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......

    opened by mendynew 8
  • undefined symbol: PyFPE_jbuf

    undefined symbol: PyFPE_jbuf

    ImportError: /root/anaconda3/envs/NLP/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

    ubuntu, pip install pkuseg any ideas?

    opened by LCorleone 5
  • what is the required encode of input file?

    what is the required encode of input file?

    C:\Python36>python
    Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg()           # 以默认配置加载模型
    >>> text = seg.cut('我爱北京天安门')  # 进行分词
    >>> print(text)
    ['我', '爱', '北京', '天安门']
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
    Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\lutao/.pkuseg\
    postag.zip
    100.0%
    >>> text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
    >>> print(text)
    [('我', 'r'), ('爱', 'v'), ('北京', 'ns'), ('天安门', 'ns')]
    >>> import pkuseg
    >>>
    >>> # 对input.txt的文件分词输出到output.txt中
    ... # 开20个进程
    ... pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
      File "<stdin>", line 3
        pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
                                                           ^
    SyntaxError: invalid syntax
    >>> pkuseg.test('c:/user/lutao/downloads/0309a.txt', 'c:/user/lutao/downloads/0309a_output.txt', nthread=10)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 520, in test
        input_file, output_file, nthread, model_name, user_dict, postag, verbose
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 444, in _test_multi_proc
        raise Exception("input_file {} does not exist.".format(input_file))
    Exception: input_file c:/user/lutao/downloads/0309a.txt does not exist.
    

    I replaced '/' with '', and encode of 0309a.txt is gbk

    >>> pkuseg.test('c:\user\lutao\downloads\0309a.txt', 'c:\user\lutao\downloads\0309a_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    

    I save 0309a.txt to 0309b.txt as utf-8 encode,

    >>> pkuseg.test('c:\user\lutao\downloads\0309b.txt', 'c:\user\lutao\downloads\0309b_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    
    opened by l1t1 5
  • python3.6 import 失败

    python3.6 import 失败

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse g Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package s (0.0.14) Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages (from pkuseg) (1.13.3+mkl)

    python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

    import pkuseg Traceback (most recent call last): File "", line 1, in File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i n import pkuseg.trainer as trainer File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

    import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject
    opened by tangchun 5
  • 这个是什么问题导致的?

    这个是什么问题导致的?

    length = 1 : 0 length = 2 : 2496 length = 3 : 2642 length = 4 : 2568 length = 5 : 1313 length = 6 : 633 length = 7 : 249 length = 8 : 133 length = 9 : 66 length = 10 : 16 length = 11 : 6 length = 12 : 1 length = 13 : 1

    start training...

    reading training & test data... done! train/test data sizes: 1/1

    r: 1 iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06% iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00% Traceback (most recent call last): File "test.py", line 8, in pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train trainer.train(config) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train score_list = trainer.test(testset, i) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test testset, self.model, writer File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore gold_tags, pred_tags, self.idx_to_chunk_tag File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore pre = correct_chunk / res_chunk * 100 ZeroDivisionError: division by zero

    opened by Fabyone 4
  • ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

    import pkuseg

    seg = pkuseg.pkuseg()

    text = "我爱北京天安门"

    cut = seg.cut(text) print(cut)

    Traceback (most recent call last): File "E:/python/work/spider/bx/piggy.py", line 1, in import pkuseg File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in import pkuseg.trainer File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in import pkuseg.inference as _inf File "init.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    opened by xhochipe 4
  • FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    安装了pkuseg 初次使用,需要下载postag.zip 下载失败 我就自己下载,并放到文件夹下 但是有报错FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    opened by hjing100 3
  • 0.0.25在binder安装报错

    0.0.25在binder安装报错

    0.0.22 可以正常安装

    Collecting numpy
      Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
    Collecting pkuseg
      Downloading pkuseg-0.0.25.tar.gz (48.8 MB)
        ERROR: Command errored out with exit status 1:
         command: /srv/conda/envs/notebook/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-h22vfd4x
             cwd: /tmp/pip-install-5d95j8mq/pkuseg/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-5d95j8mq/pkuseg/setup.py", line 5, in <module>
            import numpy as np
        ModuleNotFoundError: No module named 'numpy'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    opened by GoooIce 3
  • pip安装 使用细分领域模型 报错?

    pip安装 使用细分领域模型 报错?

    Traceback (most recent call last): 9 File "py3_cook_corpus_embedding.py", line 18, in <module> 10 seg = pkuseg.pkuseg(model_name='medicine') 11 File "/home/work/software/anaconda3/envs/py3myhao/lib/python3.6/site-packages/pkuseg/__init__.py", line 224, in __init__ 12 self.feature_extractor = FeatureExtractor.load() 13 File "pkuseg/feature_extractor.pyx", line 625, in pkuseg.feature_extractor.FeatureExtractor.load 14 FileNotFoundError: [Errno 2] No such file or directory: 'medicine/unigram_word.txt'

    另外,使用细分模型后,可以同时加上自定义词表吗?

    opened by kinghmy 3
  • wsl2 + pyenv + python3.8.5 安装报错.

    wsl2 + pyenv + python3.8.5 安装报错.

    (fastApi-env) [email protected]:/mnt/c/Users/Administrator$ pip install pkuseg Looking in indexes: http://mirrors.aliyun.com/pypi/simple Collecting pkuseg Downloading http://mirrors.aliyun.com/pypi/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB) |████████████████████████████████| 48.8 MB 79.6 MB/s ERROR: Command errored out with exit status 1: command: /home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"'; file='"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-99zrwcbj cwd: /tmp/pip-install-hjb0015_/pkuseg/ Complete output (36 lines): WARNING: The wheel package is not available. WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host mirrors.aliyun.com'. ERROR: Could not find a version that satisfies the requirement cython (from versions: none) ERROR: No matching distribution found for cython Traceback (most recent call last): File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/home/xiaxichen/.pyenv/versions/3.8.5/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 63, in <module>
        setup_package()
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 39, in setup_package
        setuptools.setup(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 162, in setup
        _install_setup_requires(attrs)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 157, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 699, in fetch_build_eggs
        resolved_dists = pkg_resources.working_set.resolve(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 779, in resolve
        dist = best[req.key] = env.best_match(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1064, in best_match
        return self.obtain(req, installer)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1076, in obtain
        return installer(requirement)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 758, in fetch_build_egg
        return fetch_build_egg(self, req)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 130, in fetch_build_egg
        raise DistutilsError(str(e)) from e
    distutils.errors.DistutilsError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

    opened by xiaxichen 2
  • cannot install in the environment of python 3.9

    cannot install in the environment of python 3.9

    Dear Sirs or Madams, the installation to the environment of python 3.9 failed. I check your repository at 'https://pypi.tuna.tsinghua.edu.cn/simple/pkuseg/'. It seems that there are no python 3.9 relevant files there. do you have any plan to support python 3.9? I also saw that in other issues, you suggested the file relevant to python 3.9. I cannot find the file. Your reply is highly appreciated. Tony

    opened by tonydeck0506 2
  • 词性标注效果过好

    词性标注效果过好

    理论上来讲效果好是一件好事,但是实际测试来讲会把不存在的地名也认作为地名

    import pkuseg
    seg = pkuseg.pkuseg(postag=True)
    text = seg.cut('广场镇是河北天津衡水冲绳东京的旧地狱和亚特兰斯地吗?')
    for word, flag in text: 
        if flag == 'ns':
            print (word)
    

    输出结果为:

    广场镇
    河北
    天津
    衡水
    冲绳
    东京
    亚特兰斯
    
    opened by axty666 1
  • TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    
    import pkuseg
    
    # 训练文件为'train.txt'
    # 测试文件为'test.txt'
    # 加载'./pretrained'目录下的模型,训练好的模型保存在'./models',训练10轮
    pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')
    
    
    opened by KangChou 1
Releases(v0.0.25)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

reverse-image-search-py bash script.sh img_name.jpg Requirements pip install requests pip install pyshorteners Dry run [ Sudhanva M 3 Dec 18, 2021

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
Natural language computational chemistry command line interface.

nlcc Install pip install nlcc Must have Open-AI Codex key: export OPENAI_API_KEY=your key here then nlcc key bindings ctrl-w copy to clipboard (Note

Andrew White 37 Dec 14, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

RE2 This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflo

286 Jan 02, 2023
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 08, 2023
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022