spaCy plugin for Transformers , Udify, ELmo, etc.

Overview

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc.

Documentation Status Gitter PyPI version test and publish

Camphr is a Natural Language Processing library that helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use Transformers , Udify, ELmo, etc. on spaCy.

Check the documentation for more information.

(For Japanese: https://qiita.com/tamurahey/items/53a1902625ccaac1bb2f)

Features

  • A spaCy plugin - Easily integration for a wide variety of methods
  • Transformers with spaCy - Fine-tuning pretrained model with Hydra. Embedding vector
  • Udify - BERT based multitask model in 75 languages
  • Elmo - Deep contextualized word representations
  • Rule base matching with Aho-Corasick, Regex
  • (for Japanese) KNP

License

Camphr is licensed under Apache 2.0.

Comments
  • NER Problem

    NER Problem

    Hello!

    First of all I would like to thank you for the great work on lib Camphr. It's been very useful to me! Can you help me with this doubt? I used lib to train a name recognition model (ner) but when I load the model using nlp = (spacy.load ("~ / outputs // 2020-04-30 // 22-28-36 // models // 9 "), and I pass a text (doc = nlp (" I live in Brazil ")), I can't get any entity recognition (doc.ents >> ()). Could you tell me why this is happening?

    opened by gabrielluz07 9
  • Gender and number subtags generation

    Gender and number subtags generation

    I was comparing the default morpho-syntactic tags generated by camphr-udify and https://github.com/Hyperparticle/udify.

    import spacy
    import stanza
    from spacy_conll import ConllFormatter
    
    nlp=spacy.load("en_udify")
    conllformatter = ConllFormatter(nlp)
    nlp.add_pipe(conllformatter, last=True)
    
    doc=nlp("Mother Teresa devoted her entire life to helping others") 
    print(doc._.conll_str)
    
    
    1	Mother	Mother	PROPN		_	2	compound	_	_
    2	Teresa	Teresa	PROPN		_	3	nsubj	_	_
    3	devoted	devote	VERB		_	0	root	_	_
    4	her	her	PRON		_	6	nmod:poss	_	_
    5	entire	entire	ADJ		_	6	amod	_	_
    6	life	life	NOUN		_	3	obj	_	_
    7	to	to	SCONJ		_	8	mark	_	_
    8	helping	help	VERB		_	3	advcl	_	_
    9	others	other	NOUN		_	8	obj	_	SpaceAfter=No
    
    

    Tags returned by https://github.com/Hyperparticle/udify, for the same input.

    prediction:  1  Mother  Mother  PROPN   _       Number=Sing     2       compound        _       _
    2       Teresa  Teresa  PROPN   _       Number=Sing     3       nsubj   _       _
    3       devoted devote  VERB    _       Mood=Ind|Tense=Past|VerbForm=Fin        0       root    _       _
    4       her     her     PRON    _       Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs   6       nmod:poss      _                                               _
    5       entire  entire  ADJ     _       Degree=Pos      6       amod    _       _
    6       life    life    NOUN    _       Number=Sing     3       obj     _       _
    7       to      to      SCONJ   _       _       8       mark    _       _
    8       helping help    VERB    _       VerbForm=Ger    3       advcl   _       _
    9       others  other   NOUN    _       Number=Plur     8       obj     _       _
    

    Gender and number subtags are missing in camphr-udify. Could we have those included by default please?

    thanks, Ranjita

    enhancement 
    opened by ranjita-naik 6
  • Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Hello. I report a problem that is happened when analyzing universal dependencies in Japanese text using KNP. When I use a adposition “から”, camphr returns a following wrong result (that shows the conj dependency tag on NOUN→VERB, but an expectation result is the obl dependency tag on VERB→NOUN).

    例1 例2

    (Note that "再結晶", "留去" are the words I added manually, but other VERB words that existed in the original dictionary such as "除去", "撹拌" generates similarly incorrect results.) Same problems sometimes occur when using an adposition "と".

    But using other adpositions, such as “より”, “にて”, camphr returns a correct result.

    例3 例4

    Environment:

    • Docker(python:3.7-buster)
    • spacy = 2.3.2
    • camphr = 0.6.5
    • pyknp = 0.4.5
    • Juman++ ver.1.02
    • KNP ver.4.19
    opened by undermakingbook 6
  • Python 3.8

    Python 3.8

    Camphr is currently pinned at python < 3.8, is there a specific reason for this and if so, what can we do to help?

    Edit: sorry, I just saw #19, still, what can we do to help?

    opened by Evpok 5
  • Support multi labels textcat pipe for transformers

    Support multi labels textcat pipe for transformers

    closes #9

    • Add TrfForMultiLabelSequenceClassification for multiple text classification.
      • pipe name: transformers_multilabel_sequence_classifier
    • Add docs for fine-tuning multi textcat pipe
      • https://github.com/PKSHATechnology-Research/camphr/blob/feature%2Fmulti-textcat/docs/source/notes/finetune_transformers.rst#multilabel-text-classification
    enhancement 
    opened by tamuhey 5
  • unofficial-udify, allennlp,  and transformers  conflicting dependencies

    unofficial-udify, allennlp, and transformers conflicting dependencies

    I'm trying to install udify on WSL as shown below.

    $ pip install unofficial-udify==0.3.0 [email protected]://github.com/PKSHATechnology-Research/camphr_models/releases/download/0.7.0/en_udify-0.7.tar.gz

    ERROR: Cannot install unofficial-udify and unofficial-udify==0.3.0 because these package versions have conflicting dependencies.

    The conflict is caused by: unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.3.0 depends on transformers<4.1 and >=4.0 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.2 depends on transformers<3.6 and >=3.4 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.1 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.0 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.1.0 depends on transformers<3.1 and >=3.0

    Is this a known issue? Could you suggest a workaroudn please?

    bug 
    opened by ranjita-naik 3
  • Missing tag information

    Missing tag information

    I noticed that the spacy tag field is empty. Is this a known issue? It looks like Udify supports some level of ufeats tagging (https://universaldependencies.org/u/feat/index.html)? I wonder if I'm supposed to b getting any of this in Spacy and I have a bug in my setup, or if it just isn't implemented yet? Would it be souced in token.tag like I'm thinking (if it does exist)?

    I also noticed that displacy doesn't render the POS info. I am wondering if that is related?

    BTW, just have to say that this is awesome.

    opened by tslater 3
  • ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    I followed the example here: https://camphr.readthedocs.io/en/latest/notes/udify.html

    I did only see the 0.7.0 model, so I went with that instead. Anyway, the German and English examples work great, but the Japanese one gives me this error:

    >>> from camphr.pipelines import load_udify
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' (/home/tyler/camphr/env/lib/python3.8/site-packages/camphr/pipelines/__init__.py)
    
    opened by tslater 3
  • doc.ents empty, doc.is_nered == False

    doc.ents empty, doc.is_nered == False

    I followed the documentation to fine-tune the bert-base-cased (en) ner model and then made a spacy doc with text "Bob Jones and Barack Obama went up the hill in Wisconsin." but the resulting doc has doc.ents = () and doc.is_nered = False.

    Am I missing something?

    Thank you!

    opened by jack-rory-staunton 3
  • Improvement for サ変 of KNP

    Improvement for サ変 of KNP

    Inside _get_child_dep(c), pos for 名詞,サ変名詞 is changed into VERB when it is followed by AUX. So now I think that _get_dep(tag[0]) should be done after _get_child_dep(c).

    opened by KoichiYasuoka 3
  • Bump transformers from 3.0.2 to 4.1.1

    Bump transformers from 3.0.2 to 4.1.1

    Bumps transformers from 3.0.2 to 4.1.1.

    Release notes

    Sourced from transformers's releases.

    Patch release: better error message & invalid trainer attribute

    This patch releases introduces:

    • A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
    • Fixes an incorrect attribute in the trainer. #8996

    Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

    Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

    Breaking changes since v3.x

    Version v4.0.0 introduces several breaking changes that were necessary.

    1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

    The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

    How to obtain the same behavior as v3.x in v4.x

    In version v3.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx")

    to obtain the same in version v4.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

    2. SentencePiece is removed from the required dependencies

    The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

    This includes the slow versions of:

    • XLNetTokenizer
    • AlbertTokenizer
    • CamembertTokenizer
    • MBartTokenizer
    • PegasusTokenizer
    • T5Tokenizer
    • ReformerTokenizer
    • XLMRobertaTokenizer

    How to obtain the same behavior as v3.x in v4.x

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bumps certifi from 2021.5.30 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(0.7.0)
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022