A full spaCy pipeline and models for scientific/biomedical documents.

Overview

This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

Just looking to test out the models on your data? Check out our demo.

Installation

Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:

pip install scispacy

to install a model (see our full selection of available models below), run a command like the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy. Take a look below in the "Setting up a virtual environment" section if you need some help with this. Additionally, scispacy uses modern features of Python and as such is only available for Python 3.6 or greater.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for scispaCy. If you already have a Python 3.6 or 3.7 environment you want to use, you can skip to the 'installing via pip' section.

  1. Follow the installation instructions for Conda.

  2. Create a Conda environment called "scispacy" with Python 3.6:

    conda create -n scispacy python=3.6
  3. Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.

    source activate scispacy

Now you can install scispacy and one of the models using the steps above.

Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

Note on upgrading

If you are upgrading scispacy, you will need to download the models again, to get the model versions compatible with the version of scispacy that you have. The link to the model that you download should contain the version number of scispacy that you have.

Available Models

To install a model, click on the link below to download the model, and then run

pip install </path/to/download>

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running

pip install CMD-V(to paste the copied URL)
Model Description Install URL
en_core_sci_sm A full spaCy pipeline for biomedical data with a ~100k vocabulary. Download
en_core_sci_md A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. Download
en_core_sci_lg A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. Download
en_core_sci_scibert A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. Download
en_ner_craft_md A spaCy NER model trained on the CRAFT corpus. Download
en_ner_jnlpba_md A spaCy NER model trained on the JNLPBA corpus. Download
en_ner_bc5cdr_md A spaCy NER model trained on the BC5CDR corpus. Download
en_ner_bionlp13cg_md A spaCy NER model trained on the BIONLP13CG corpus. Download

Additional Pipeline Components

AbbreviationDetector

The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

You can access the list of abbreviations via the doc._.abbreviations attribute and for a given abbreviation, you can access it's long form (which is a spacy.tokens.Span) using span._.long_form, which will point to another span in the document.

Example Usage

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor

EntityLinker

The EntityLinker is a SpaCy component which performs linking to a knowledge base. The linker simply performs a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base using an approximate nearest neighbours search.

Currently (v2.5.0), there are 5 supported linkers:

  • umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.
  • mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derrived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.
  • rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.
  • go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.
  • hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

You may want to play around with some of the parameters below to adapt to your use case (higher precision, higher recall etc).

  • resolve_abbreviations : bool = True, optional (default = False) Whether to resolve abbreviations identified in the Doc before performing linking. This parameter has no effect if there is no AbbreviationDetector in the spacy pipeline.
  • k : int, optional, (default = 30) The number of nearest neighbours to look up from the candidate generator per mention.
  • threshold : float, optional, (default = 0.7) The threshold that a mention candidate must reach to be added to the mention in the Doc as a mention candidate.
  • no_definition_threshold : float, optional, (default = 0.95) The threshold that a entity candidate must reach to be added to the mention in the Doc as a mention candidate if the entity candidate does not have a definition.
  • filter_for_definitions: bool, default = True Whether to filter entities that can be returned to only include those with definitions in the knowledge base.
  • max_entities_per_mention : int, optional, default = 5 The maximum number of entities which will be returned for a given mention, regardless of how many are nearest neighbours are found.

This class sets the ._.kb_ents attribute on spacy Spans, which consists of a List[Tuple[str, float]] corresponding to the KB concept_id and the associated score for a list of max_entities_per_mention number of entities.

You can look up more information for a given id using the kb attribute of this class:

print(linker.kb.cui_to_entity[concept_id])

Example Usage

import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)
>>> Name: bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])


>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0541794, Name: Skeletal muscle atrophy
>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,
                fiber diameter, force production and fatigue resistance in response to ...
>>> TUI(s): T046
>>> Aliases: (total: 9):
         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....

>>> CUI: C1447749, Name: AR protein, human
>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.
                This protein plays a role in the modulation of steroid-dependent gene transcription.
>>> TUI(s): T116, T192
>>> Aliases (abbreviated, total: 16):
         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...

Hearst Patterns (v0.3.0 and up)

This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

  • The relation rule used to extract the hyponym (type: str)
  • The more general concept (type: spacy.Span)
  • The more specific concept (type: spacy.Span)

Usage:

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]

Citing

If you use ScispaCy in your research, please cite ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.

@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

ScispaCy is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • pip install fails

    pip install fails

    I've created the conda env, and ran pip install scispacy see the result:

    (scispacy) lucas-mbp:jats lfoppiano$ pip install scispacy
    Collecting scispacy
      Using cached https://files.pythonhosted.org/packages/72/55/30b30a78abafaaf34d0d8368a090cf713964d6c97c5e912fb2016efadab0/scispacy-0.2.2-py3-none-any.whl
    Collecting numpy (from scispacy)
      Downloading https://files.pythonhosted.org/packages/0f/c9/3526a357b6c35e5529158fbcfac1bb3adc8827e8809a6d254019d326d1cc/numpy-1.16.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
         |████████████████████████████████| 13.9MB 3.5MB/s 
    Collecting joblib (from scispacy)
      Using cached https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl
    Collecting spacy>=2.1.3 (from scispacy)
      Downloading https://files.pythonhosted.org/packages/cb/ef/cccdeb1ababb2cb04ae464098183bcd300b8f7e4979ce309669de8a56b9d/spacy-2.1.6-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (34.6MB)
         |████████████████████████████████| 34.6MB 33.6MB/s 
    Collecting conllu (from scispacy)
      Downloading https://files.pythonhosted.org/packages/ae/54/b0ae1199f3d01666821b028cd967f7c0ac527ab162af433d3da69242cea2/conllu-1.3.1-py2.py3-none-any.whl
    Collecting awscli (from scispacy)
      Using cached https://files.pythonhosted.org/packages/e6/48/8c5ac563a88239d128aa3fb67415211c19bd653fab01c7f11cecf015c343/awscli-1.16.203-py2.py3-none-any.whl
    Collecting nmslib>=1.7.3.6 (from scispacy)
      Using cached https://files.pythonhosted.org/packages/b2/4d/4d110e53ff932d7a1ed9c2f23fe8794367087c29026bf9d4b4d1e27eda09/nmslib-1.8.1.tar.gz
        ERROR: Complete output from command python setup.py egg_info:
        ERROR: Download error on https://pypi.org/simple/numpy/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
        Couldn't find index page for 'numpy' (maybe misspelled?)
        Download error on https://pypi.org/simple/: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
        No local packages or working download links found for numpy
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/setup.py", line 172, in <module>
            zip_safe=False,
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
            _install_setup_requires(attrs)
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/__init__.py", line 139, in _install_setup_requires
            dist.fetch_build_eggs(dist.setup_requires)
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
            replace_conflicting=True,
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 782, in resolve
            replace_conflicting=replace_conflicting
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1065, in best_match
            return self.obtain(req, installer)
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1077, in obtain
            return installer(requirement)
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
            return cmd.easy_install(req)
          File "/anaconda3/envs/scispacy/lib/python3.6/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
            raise DistutilsError(msg)
        distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('numpy')
        ----------------------------------------
    ERROR: Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/mk/scd8428n18jfgh3jdthbvpz00000gn/T/pip-install-l00jm4xn/nmslib/
    (scispacy) lucas-mbp:jats lfoppiano$ 
    

    To solve the issue I had to install numpy and nmslib:

    conda install numpy
    conda install -c akode nmslib
    

    It seems to work, but maybe is not the proper way to solve it - the pip script should be updated perhaps?

    opened by lfoppiano 38
  • Combine 'ner' model with 'core_sci' model

    Combine 'ner' model with 'core_sci' model

    Hi,

    I am working on a project using neuralcoref and I would like to incorporate the scispacy ner models. My hope was to use one of the ner models in combination with the core_sci tagger and dependency parser.

    NeuralCoref depends on the tagger, parser, and ner.

    So far I have tried this code:

    cust_ner = spacy.load('en_ner_craft_md')
    nlp = spacy.load('en_core_sci_md')
    nlp.remove_pipe('ner')
    nlp.add_pipe(cust_ner, name="ner", last=True)
    

    but when I pass text to the nlp object , I get the following error: TypeError: Argument 'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)

    When I look at the nlp.pipeline attribute after adding the cust_ner to the pipe I see the cust_ner added as a Language object rather than a EntityRecognizer object:

    [('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fb84976eda0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fb849516288>), ('ner', <spacy.lang.en.English object at 0x7fb853725668>)]
    

    Before I start hacking away and writing terrible code, I thought I would reach out to see if you had any suggestions in how to accomplish what I am after?

    Thanks in advance and for all that you folks do!

    opened by masonedmison 26
  • No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

    No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

    I am getting following error: Traceback (most recent call last): File "scispacy.py", line 2, in import scispacy File "/Users/shai26/office/spacy/scispacy/scispacy.py", line 5, in nlp = spacy.load("en_core_sci_sm") File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/init.py", line 21, in load return util.load_model(name, **overrides) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 114, in load_model return load_model_from_package(name, **overrides) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/spacy/util.py", line 134, in load_model_from_package cls = importlib.import_module(name) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/en_core_sci_sm/init.py", line 7, in from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter ModuleNotFoundError: No module named 'scispacy.custom_sentence_segmenter'; 'scispacy' is not a package

    opened by sakibshaik 19
  •  [E167] Unknown morphological feature: 'ConjType'

    [E167] Unknown morphological feature: 'ConjType'

    When I run nlp(doc) I got error: [E167] Unknown morphological feature: 'ConjType' (9141427322507498425). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date: python -m spacy validate some of the docs work while some don't.

    opened by fireholder 15
  • kb_ents gives no results from custom KB

    kb_ents gives no results from custom KB

    Following this discussion #383, where I got my custom KB to work.

    I tried to test the code and for some reason it is not giving me anything. Here is the code I tested it with:

    linker = CandidateGenerator(name="myCustom")
    text = "TR Max Velocity: 2.3 m/s"
    doc = nlp(text)
    spacy.displacy.render(doc, style = "ent", jupyter = True)
    
    entity = doc.ents[2]
    print("Name: ", entity)
    
    for umls_ent in entity._.kb_ents:
        print(umls_ent)
        print(linker.kb.cui_to_entity[umls_ent[0]])
        print("----------------------")
    

    This would give:

    Name:  m/s
    

    there was no ---------------------- which means it did not even enter the for loop.

    I was wondering why this is the case.

    If this helps, this is the jsonl file that I ran this script (https://github.com/allenai/scispacy/blob/master/scripts/create_linker.py) with:

    ...
    {"concept_id": "U0013", "aliases": ["m/s"], "types": ["UN1T5"], "canonical_name": "m/s"}
    ...
    
    opened by farrandi 14
  • aws s3 downloading

    aws s3 downloading

    I am currently trying to train using my own corpus following the project.yml file. I try to download several files: aws s3 cp s3://ai2-s2-scispacy/data/ud_ontonotes.tar.gz assets/ud_ontonotes.tar.gz tar -xzvf assets/ud_ontonotes.tar.gz -C assets/ rm assets/ud_ontonotes.tar.gz ############################################################# aws s3 cp s3://ai2-s2-scispacy/data/med_mentions.tar.gz assets/med_mentions.tar.gz tar -xzvf assets/med_mentions.tar.gz -C assets/ rm assets/med_mentions.tar.gz ############################################################# aws s3 cp s3://ai2-s2-scispacy/data/ner/ assets --recursive --exclude '' --include '.tsv'

    But it fails due to ''' fatal error: Unable to locate credentials ''' I am wondering if anyone know how to solve this problem. Thanks!!!

    opened by CharlesQ9 13
  • Warning about incompatible spaCy models.

    Warning about incompatible spaCy models.

    I get the following error when trying to load en_core_sci_sm:

    UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
      warnings.warn(warn_msg)
    

    Steps to reproduce: Create clean Conda environment and activate

    conda create --name scispacy python=3.8
    conda activate scispacy
    

    Install scispacy and install the latest en_core_sci_sm model.

    pip install scispacy
    pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
    

    Attempt import

    (scispacy) $ python -c "import spacy; nlp=spacy.load('en_core_sci_sm')"
    /home/davidw/miniconda3/envs/scispacy/lib/python3.8/site-packages/spacy/util.py:271: UserWarning: [W031] Model 'en_core_sci_sm' (0.2.4) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
      warnings.warn(warn_msg)
    

    Is this warning important or can I ignore it?

    Thanks,

    Dave

    opened by dwadden 11
  • DeprecationWarning from `spacy_legacy`

    DeprecationWarning from `spacy_legacy`

    Hi there, I recently upgraded to spacy 3 and scispacy 0.4, but I am now getting a warning whenever I use the small scispacy model (I have not tried any other model).

    I am getting a DeprecationWarning on a fresh install in python 3.8 with the latest version of scispacy and en_core_sci_sm.

    Steps to reproduce:

    pip install scispacy pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz

    import spacy
    nlp = spacy.load("en_core_sci_sm")
    
    import warnings
    warnings.filterwarnings("error")
    nlp("Hello World")
    

    Any input to the nlp model triggers the same warning:

    /opt/miniconda3/envs/clean/lib/python3.8/site-packages/spacy_legacy/layers/staticvectors_v1.py in forward(model, docs, is_train)
         43     )
         44     try:
    ---> 45         vectors_data = model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True)
         46     except ValueError:
         47         raise RuntimeError(Errors.E896)
    
    DeprecationWarning: Out of bound index found. 
    This was previously ignored when the indexing result contained no elements. 
    In the future the index error will be raised. 
    This error occurs either due to an empty slice, or if an array has zero elements even before indexing.
    (Use `warnings.simplefilter('error')` to turn this DeprecationWarning into an error and get more details on the invalid index.)
    

    Any ideas as to how to resolve this without manually ignoring the warning?

    bug 
    opened by gautierdag 10
  • Span is not serializable in abbreviations - figure out a better workaround

    Span is not serializable in abbreviations - figure out a better workaround

    import spacy
    
    from scispacy.abbreviation import AbbreviationDetector
    
    nlp = spacy.load("en_core_sci_sm")
    
    # Add the abbreviation pipe to the spacy pipeline.
    nlp.add_pipe("abbreviation_detector")
    
    test = ["Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."]
    
    print("Abbreviation", "\t", "Definition")
    for doc in nlp.pipe(test, n_process=4):
        for abrv in doc._.abbreviations:
            print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
    

    Running that code leads to this. The error message doesn't make a lot of sense, It could be because there are more processes than entries. If you remove n_process the solves the problem.

    Abbreviation     Definition
    Abbreviation     Definition
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
        exitcode = _main(fd, parent_sentinel)
      File "C:\Python38\lib\multiprocessing\spawn.py", line 125, in _main
        prepare(preparation_data)
      File "C:\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
        _fixup_main_from_path(data['init_main_from_path'])
      File "C:\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
        main_content = runpy.run_path(main_path,
      File "C:\Python38\lib\runpy.py", line 265, in run_path
        return _run_module_code(code, init_globals, run_name,
      File "C:\Python38\lib\runpy.py", line 97, in _run_module_code
        _run_code(code, mod_globals, init_globals,
      File "C:\Python38\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "C:\Users\alexd\Dropbox (UFL)\UFII_COVID19_RESEARCH_TOPICS\cord19\text_parsing_pipeline\test.py", line 13, in <module>
        for doc in nlp.pipe(test, n_process=4):
      File "C:\Python38\lib\site-packages\spacy\language.py", line 1475, in pipe
        for doc in docs:
      File "C:\Python38\lib\site-packages\spacy\language.py", line 1511, in _multiprocessing_pipe
        proc.start()
      File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
        self._popen = self._Popen(self)
      File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
        return _default_context.get_context().Process._Popen(process_obj)
      File "C:\Python38\lib\multiprocessing\context.py", line 327, in _Popen
        return Popen(process_obj)
      File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
        prep_data = spawn.get_preparation_data(process_obj._name)
      File "C:\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
        _check_not_importing_main()
      File "C:\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
        raise RuntimeError('''
    RuntimeError:
            An attempt has been made to start a new process before the
            current process has finished its bootstrapping phase.
    
            This probably means that you are not using fork to start your
            child processes and you have forgotten to use the proper idiom
            in the main module:
    
                if __name__ == '__main__':
                    freeze_support()
                    ...
    
            The "freeze_support()" line can be omitted if the program
            is not going to be frozen to produce an executable.
    

    This is the error message from my main piece of code with more data. It sort of makes more sense. I think it has to something to do with how the multiprocess pipe collects the results of the workers. The error pops up after a while so it's definitely running.

    Process Process-1:
    Traceback (most recent call last):
      File "C:\Python38\lib\multiprocessing\process.py", line 315, in _bootstrap
        self.run()
      File "C:\Python38\lib\multiprocessing\process.py", line 108, in run
        self._target(*self._args, **self._kwargs)
      File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in _apply_pipes
        sender.send([doc.to_bytes() for doc in docs])
      File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in <listcomp>
        sender.send([doc.to_bytes() for doc in docs])
      File "spacy\tokens\doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
      File "spacy\tokens\doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
      File "C:\Python38\lib\site-packages\spacy\util.py", line 1134, in to_dict
        serialized[key] = getter()
      File "spacy\tokens\doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
      File "C:\Python38\lib\site-packages\srsly\_msgpack_api.py", line 14, in msgpack_dumps
        return msgpack.dumps(data, use_bin_type=True)
      File "C:\Python38\lib\site-packages\srsly\msgpack\__init__.py", line 55, in packb
        return Packer(**kwargs).pack(o)
      File "srsly\msgpack\_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
      File "srsly\msgpack\_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
      File "srsly\msgpack\_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
      File "srsly\msgpack\_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
      File "srsly\msgpack\_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
    TypeError: can not serialize 'spacy.tokens.span.Span' object
    

    Running spacy 3.0, the latest version, and on Windows 10.

    bug help wanted 
    opened by f0lie 10
  • How to visualize named entities in custom colors

    How to visualize named entities in custom colors

    There's an options in Spacy which allows us to use custom colors for named entity visualization. I'm trying to use the same options in scispacy for the named entities. I simply created two lists of entities and randomly generated colors and put them in options dictionary like the following:

    options = {"ents": entities, "colors": colors}

    Where entities is a list of NEs in scispacy NER models and colors is a list of the same size. But using such an option in either displacy.serve or displacy.render (for jupyter) does not work. I'm using the options like the following:

    displacy.serve(doc, style="ent", options=options)

    I wonder if using the color option only works for predefined named entities in the Spacy or there's something wrong with the way I'm using the option?

    opened by phosseini 10
  • What does Doc.tensor contain for non-transformer models?

    What does Doc.tensor contain for non-transformer models?

    Hi, we are processing large amounts of text and need to serialize Doc objects efficiently. We are using the sci_md model, and it appears that when converting a Doc to bytes, the majority of the space is taken by the Doc.tensor data. What does that data represent exactly? Is it static, and/or do I have to include it in each serialized Doc object?

    opened by ldorigo 9
  • UserWarning: [W036] The component 'matcher' does not have any patterns defined.

    UserWarning: [W036] The component 'matcher' does not have any patterns defined.

    Hello,

    Happy Holidays!

    I used the last sentence from the example on your README.md file :

    doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
               inherited motor neuron disease caused by the expansion \
               of a polyglutamine tract within the androgen receptor (AR). \
               SBMA can be caused by this easily.")
    

    Here's my code:

    import spacy
    import scispacy
    
    from scispacy.linking import EntityLinker
    
    nlp = spacy.load("en_ner_craft_md")
    nlp.add_pipe("abbreviation_detector")
    nlp.add_pipe(
                "scispacy_linker",
                config={"resolve_abbreviations": True, "linker_name":"mesh"},
            )
    doc = self.nlp("SBMA can be caused by this easily.") # from the scispacy example
    
    

    I get the following error:

    ../site-packages/scispacy/abbreviation.py:230: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
      global_matches = self.global_matcher(doc)
    

    Any guidance would be greatly appreciated!

    scispacy                  0.5.1  
    spacy                     3.4.4  
    
    opened by hrshdhgd 1
  • "Mesh" and "Hpo" linkers give the same result

    Hi, I'm trying to annotate data using Scispacy. Loading "mesh" and "hpo" gives the exact same results no matter what is the input. For example: image-1 image-2 image-3

    I tried on many texts and both linkers plotted the same results.

    opened by almogmor 6
  • incompatability error when installing en_core_sci_sm

    incompatability error when installing en_core_sci_sm

    I ran `pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz`
    and got this error:
    
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    scispacy 0.4.0 requires spacy<3.1.0,>=3.0.0, but you have spacy 3.4.4 which is incompatible.
    en-core-web-sm 3.0.0 requires spacy<3.1.0,>=3.0.0, but you have spacy 3.4.4 which is incompatible.
    docanalysis 0.2.0 requires spacy==3.0.7, but you have spacy 3.4.4 which is incompatible.
    
    opened by EmanuelFaria 1
  • entity recognition doesn't recognize locations

    entity recognition doesn't recognize locations

    Hi, Thank you for this wonderful library! Trying to use 'en_core_sci_lg' for simple entity recognition task, not sure if I'm missing something in the setup or it's a bug, would appreciate the help. This is the output of an example from spicy documentation.

    when trying this:

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
    
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    

    the result is -

    Apple 0 5 ORG
    U.K. 27 31 GPE
    $1 billion 44 54 MONEY
    

    **but when trying the same code with en_core_sci_lg - **

     import spacy
    
    nlp = spacy.load('en_core_sci_lg')
    doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
    
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    

    the result is -

    Apple 0 5 ENTITY
    U.K. 27 31 ENTITY
    startup 32 39 ENTITY
    

    working on google colab, installed the following - `! pip install spacy

    ! pip install scispacy

    ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz`

    Thank you!

    opened by maayansharon10 1
  • Parsed Entity linked incorrectly to UMLS concept

    Parsed Entity linked incorrectly to UMLS concept

    Hi,

    I'm parsing text from clinicaltrials.gov (Trial ID NCT04837209) using scispaCy plus language model 'en_core_sci_md' and seeing 'Dostarlimab' being linked to UMLS concept C1621793 which is a bird (a Starling).

    It looks like this is the result of fuzzy matching - both words have a substring ('starlit') in common - as evident by the low match probability (0.5594).

    However, the biologic drug Dostarlimab is in the latest UMLS release (2022AB) as the concept C5242455. Is scispaCy linking to an older version of UMLS?

    Thanks, Ron

    opened by rxk2rxk 2
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
Releases(v0.5.1)
  • v0.5.1(Sep 7, 2022)

  • v0.5.0(Mar 10, 2022)

  • v0.4.0(Feb 12, 2021)

    This release of scispacy is compatible with Spacy 3. It also includes a new model 🥳 , en_core_sci_scibert, which uses scibert base uncased to do parsing and POS tagging (but not NER, yet. This will come in a later release).

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Oct 16, 2020)

    New Features

    Hearst Patterns

    This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.

    Passing extended=True to the HyponymDetector will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

    This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

    • The relation rule used to extract the hyponym (type: str)
    • The more general concept (type: spacy.Span)
    • The more specific concept (type: spacy.Span)

    Usage:

    import spacy
    from scispacy.hyponym_detector import HyponymDetector
    
    nlp = spacy.load("en_core_sci_sm")
    hyponym_pipe = HyponymDetector(nlp, extended=True)
    nlp.add_pipe(hyponym_pipe, last=True)
    
    doc = nlp("Keystone plant species such as fig trees are good for the soil.")
    
    print(doc._.hearst_patterns)
    >>> [('such_as', Keystone plant species, fig trees)]
    

    Ontonotes Mixin: Clear Format > UD

    Thanks to Yoav Goldberg for this fix! Yoav noticed that the dependency labels for the Onotonotes data use a different format than the converted GENIA Trees. Yoav wrote some scripts to convert between them, including normalising of some syntactic phenomena that were being treated inconsistently between the two corpora.

    Bug Fixes

    #252 - removed duplicated aliases in the entity linkers, reducing the size of the UMLS linker by ~10% #249 - fix the path to the rxnorm linker

    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jul 8, 2020)

    New Features 🥇

    New Models

    • Models compatible with Spacy 2.3.0 🥳

    Entity Linkers

    #246, #233

    • Updated the UMLS KB to use the 2020AA release, categories 0,1,2,9.

    • umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.

    • mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derrived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.

    • rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

    • go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.

    • hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

    Bug Fixes 🐛

    #217 - Fixes a bug in the Abbreviation detector

    API Changes

    • Entity Linkers now modify the Span._.kb_ents rather than the Span._.umls_ents to reflect the fact that we now have more than one entity linker. Span._.umls_ents will be deprecated in v1.0.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Oct 23, 2019)

    Retrains the models to be compatible with spacy 2.2.1 and rewrites the optional sentence splitting pipe to use pysbd. This pipe is experimental at this point and may be rough around the edges.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Jun 3, 2019)

  • v0.2.0(Apr 3, 2019)

Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

Antoreep Jana 1 Dec 31, 2021
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
Saptak Bhoumik 14 May 24, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 04, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022