A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments
  • Which language model is using for minilm

    Which language model is using for minilm

    I am using the following code snippet for coreference resolution

    predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")
    

    While checking the below source code,

    "minilm": {
            "url": (
                "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
            ),
            "f1_score_ontonotes": 74,
            "file_extension": ".tar.gz",
        },
    

    it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?

    opened by pradeepdev-1995 7
  • Error when using coref as a spaCy pipeline

    Error when using coref as a spaCy pipeline

    Hi all, while trying to run a spacy test

    import spacy
    import crosslingual_coreference
    
    text = """
        Do not forget about Momofuku Ando!
        He created instant noodles in Osaka.
        At that location, Nissin was founded.
        Many students survived by eating these noodles, but they don't even know him."""
    
    # use any model that has internal spacy embeddings
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})
    
    doc = nlp(text)
    
    print(doc._.coref_clusters)
    print(doc._.resolved_text)
    

    I encountered the following issue:

    [nltk_data] Downloading package omw-1.4 to
    [nltk_data]     /home/user/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    Traceback (most recent call last):
      File "/home/user/test_coref/test.py", line 12, in <module>
        nlp.add_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
        pipe_component = self.create_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
        resolved = registry.resolve(cfg, validate=validate)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
        resolved, _ = cls._make(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
        filled, _, resolved = cls._fill(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
        getter_result = getter(*args, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
        return SpacyPredictor(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
        super().__init__(language, device, model_name, chunk_size, chunk_overlap)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
        self.set_coref_model()
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
        self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
        load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
        dataset_reader, validation_dataset_reader = _load_dataset_readers(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
        dataset_reader = DatasetReader.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
        kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
        constructed_arg = pop_and_construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
        return construct_arg(class_name, name, popped_params, annotation, default, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
        value_dict[key] = construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
        result = annotation.from_params(params=popped_params, **subextras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
        return constructor_to_call(**kwargs)  # type: ignore
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
        self._matched_indexer = PretrainedTransformerIndexer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
        self._allennlp_tokenizer = PretrainedTransformerTokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
        self.tokenizer = cached_transformers.get_tokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
        tokenizer = transformers.AutoTokenizer.from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
        return cls._from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
        super().__init__(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
        fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    Exception: EOF while parsing a list at line 1 column 4920583
    

    Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

    (.venv) [email protected]$ pip freeze
    aiohttp==3.8.1
    aiosignal==1.2.0
    allennlp==2.9.3
    allennlp-models==2.9.3
    async-timeout==4.0.2
    attrs==21.4.0
    base58==2.1.1
    blis==0.7.7
    boto3==1.23.5
    botocore==1.26.5
    cached-path==1.1.2
    cachetools==5.1.0
    catalogue==2.0.7
    certifi==2022.5.18.1
    charset-normalizer==2.0.12
    click==8.0.4
    conllu==4.4.1
    crosslingual-coreference==0.2.4
    cymem==2.0.6
    datasets==2.2.1
    dill==0.3.5.1
    docker-pycreds==0.4.0
    en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
    en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
    fairscale==0.4.6
    filelock==3.6.0
    frozenlist==1.3.0
    fsspec==2022.5.0
    ftfy==6.1.1
    gitdb==4.0.9
    GitPython==3.1.27
    google-api-core==2.8.0
    google-auth==2.6.6
    google-cloud-core==2.3.0
    google-cloud-storage==2.3.0
    google-crc32c==1.3.0
    google-resumable-media==2.3.3
    googleapis-common-protos==1.56.1
    h5py==3.6.0
    huggingface-hub==0.5.1
    idna==3.3
    iniconfig==1.1.1
    Jinja2==3.1.2
    jmespath==1.0.0
    joblib==1.1.0
    jsonnet==0.18.0
    langcodes==3.3.0
    lmdb==1.3.0
    MarkupSafe==2.1.1
    more-itertools==8.13.0
    multidict==6.0.2
    multiprocess==0.70.12.2
    murmurhash==1.0.7
    nltk==3.7
    numpy==1.22.4
    packaging==21.3
    pandas==1.4.2
    pathtools==0.1.2
    pathy==0.6.1
    Pillow==9.1.1
    pluggy==1.0.0
    preshed==3.0.6
    promise==2.3
    protobuf==3.20.1
    psutil==5.9.1
    py==1.11.0
    py-rouge==1.1
    pyarrow==8.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pydantic==1.8.2
    pyparsing==3.0.9
    pytest==7.1.2
    python-dateutil==2.8.2
    pytz==2022.1
    PyYAML==6.0
    regex==2022.4.24
    requests==2.27.1
    responses==0.18.0
    rsa==4.8
    s3transfer==0.5.2
    sacremoses==0.0.53
    scikit-learn==1.1.1
    scipy==1.6.1
    sentence-transformers==2.2.0
    sentencepiece==0.1.96
    sentry-sdk==1.5.12
    setproctitle==1.2.3
    shortuuid==1.0.9
    six==1.16.0
    smart-open==5.2.1
    smmap==5.0.0
    spacy==3.2.4
    spacy-alignments==0.8.5
    spacy-legacy==3.0.9
    spacy-loggers==1.0.2
    spacy-sentence-bert==0.1.2
    spacy-transformers==1.1.5
    srsly==2.4.3
    tensorboardX==2.5
    termcolor==1.1.0
    thinc==8.0.16
    threadpoolctl==3.1.0
    tokenizers==0.12.1
    tomli==2.0.1
    torch==1.10.2
    torchaudio==0.10.2
    torchvision==0.11.3
    tqdm==4.64.0
    transformers==4.17.0
    typer==0.4.1
    typing-extensions==4.2.0
    urllib3==1.26.9
    wandb==0.12.16
    wasabi==0.9.1
    wcwidth==0.2.5
    word2number==1.1
    xxhash==3.0.0
    yarl==1.7.2
    

    Do you have any recommendations? Is there an installation step missing?

    Thanks in advance!

    opened by alexander-belikov 4
  • Comparatively high initial prediction time for first predict() hit

    Comparatively high initial prediction time for first predict() hit

    I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

    predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

    Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

    opened by nemeer 2
  • Why does this package need to install google cloud auth, storage, api etc?

    Why does this package need to install google cloud auth, storage, api etc?

    Hi,

    after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

    Is there a reason (I don't get) why this library needs these packages?

    Thanks

    opened by GiacomoCherry 1
  • HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

    requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

    opened by jscoder1009 1
  • Retrieving cluster heads without replacing corefs

    Retrieving cluster heads without replacing corefs

    I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

    opened by MikeMikeMikeMike 1
  • [Errno 101] Network is unreachable

    [Errno 101] Network is unreachable

    Hello, when I try to run the code below

    predictor = Predictor(
        language="en_core_web_sm", device=1, model_name="info_xlm"
    )
    

    I get the following error:

    ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

    Is this url still valid and what should I use instead?

    opened by ttranslit 1
  • spaCy issues and suggestions

    spaCy issues and suggestions

    @martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

    • batching
    • empty cluster issue (resolved)
    • additional model pro-drop languages
    bug enhancement 
    opened by davidberenstein1957 0
  • feat: look into ONNX enhanched transformer embeddings

    feat: look into ONNX enhanched transformer embeddings

    Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

    enhancement 
    opened by davidberenstein1957 3
Releases(0.2.9)
Owner
Pandora Intelligence
Pandora Intelligence is an independent intelligence company, specialized in security risks.
Pandora Intelligence
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
Auto-researching tool generating word documents.

About ResearchTE automates researching by generating document with answers to given questions. Supports getting results from: Google DuckDuckGo (with

1 Feb 14, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
This program do translate english words to portuguese

Python-Dictionary This program is used to translate english words to portuguese. Web-Scraping This program use BeautifulSoap to make web scraping, so

João Assalim 1 Oct 10, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022
Pangu-Alpha for Transformers

Pangu-Alpha for Transformers Usage Download MindSpore FP32 weights for GPU from here to data/Pangu-alpha_2.6B.ckpt Activate MindSpore environment and

One 5 Oct 01, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

Abhishek Kumar Jha 2 Feb 02, 2022
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022