A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments
  • Which language model is using for minilm

    Which language model is using for minilm

    I am using the following code snippet for coreference resolution

    predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")
    

    While checking the below source code,

    "minilm": {
            "url": (
                "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
            ),
            "f1_score_ontonotes": 74,
            "file_extension": ".tar.gz",
        },
    

    it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?

    opened by pradeepdev-1995 7
  • Error when using coref as a spaCy pipeline

    Error when using coref as a spaCy pipeline

    Hi all, while trying to run a spacy test

    import spacy
    import crosslingual_coreference
    
    text = """
        Do not forget about Momofuku Ando!
        He created instant noodles in Osaka.
        At that location, Nissin was founded.
        Many students survived by eating these noodles, but they don't even know him."""
    
    # use any model that has internal spacy embeddings
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})
    
    doc = nlp(text)
    
    print(doc._.coref_clusters)
    print(doc._.resolved_text)
    

    I encountered the following issue:

    [nltk_data] Downloading package omw-1.4 to
    [nltk_data]     /home/user/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    Traceback (most recent call last):
      File "/home/user/test_coref/test.py", line 12, in <module>
        nlp.add_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
        pipe_component = self.create_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
        resolved = registry.resolve(cfg, validate=validate)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
        resolved, _ = cls._make(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
        filled, _, resolved = cls._fill(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
        getter_result = getter(*args, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
        return SpacyPredictor(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
        super().__init__(language, device, model_name, chunk_size, chunk_overlap)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
        self.set_coref_model()
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
        self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
        load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
        dataset_reader, validation_dataset_reader = _load_dataset_readers(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
        dataset_reader = DatasetReader.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
        kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
        constructed_arg = pop_and_construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
        return construct_arg(class_name, name, popped_params, annotation, default, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
        value_dict[key] = construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
        result = annotation.from_params(params=popped_params, **subextras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
        return constructor_to_call(**kwargs)  # type: ignore
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
        self._matched_indexer = PretrainedTransformerIndexer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
        self._allennlp_tokenizer = PretrainedTransformerTokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
        self.tokenizer = cached_transformers.get_tokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
        tokenizer = transformers.AutoTokenizer.from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
        return cls._from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
        super().__init__(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
        fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    Exception: EOF while parsing a list at line 1 column 4920583
    

    Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

    (.venv) [email protected]$ pip freeze
    aiohttp==3.8.1
    aiosignal==1.2.0
    allennlp==2.9.3
    allennlp-models==2.9.3
    async-timeout==4.0.2
    attrs==21.4.0
    base58==2.1.1
    blis==0.7.7
    boto3==1.23.5
    botocore==1.26.5
    cached-path==1.1.2
    cachetools==5.1.0
    catalogue==2.0.7
    certifi==2022.5.18.1
    charset-normalizer==2.0.12
    click==8.0.4
    conllu==4.4.1
    crosslingual-coreference==0.2.4
    cymem==2.0.6
    datasets==2.2.1
    dill==0.3.5.1
    docker-pycreds==0.4.0
    en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
    en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
    fairscale==0.4.6
    filelock==3.6.0
    frozenlist==1.3.0
    fsspec==2022.5.0
    ftfy==6.1.1
    gitdb==4.0.9
    GitPython==3.1.27
    google-api-core==2.8.0
    google-auth==2.6.6
    google-cloud-core==2.3.0
    google-cloud-storage==2.3.0
    google-crc32c==1.3.0
    google-resumable-media==2.3.3
    googleapis-common-protos==1.56.1
    h5py==3.6.0
    huggingface-hub==0.5.1
    idna==3.3
    iniconfig==1.1.1
    Jinja2==3.1.2
    jmespath==1.0.0
    joblib==1.1.0
    jsonnet==0.18.0
    langcodes==3.3.0
    lmdb==1.3.0
    MarkupSafe==2.1.1
    more-itertools==8.13.0
    multidict==6.0.2
    multiprocess==0.70.12.2
    murmurhash==1.0.7
    nltk==3.7
    numpy==1.22.4
    packaging==21.3
    pandas==1.4.2
    pathtools==0.1.2
    pathy==0.6.1
    Pillow==9.1.1
    pluggy==1.0.0
    preshed==3.0.6
    promise==2.3
    protobuf==3.20.1
    psutil==5.9.1
    py==1.11.0
    py-rouge==1.1
    pyarrow==8.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pydantic==1.8.2
    pyparsing==3.0.9
    pytest==7.1.2
    python-dateutil==2.8.2
    pytz==2022.1
    PyYAML==6.0
    regex==2022.4.24
    requests==2.27.1
    responses==0.18.0
    rsa==4.8
    s3transfer==0.5.2
    sacremoses==0.0.53
    scikit-learn==1.1.1
    scipy==1.6.1
    sentence-transformers==2.2.0
    sentencepiece==0.1.96
    sentry-sdk==1.5.12
    setproctitle==1.2.3
    shortuuid==1.0.9
    six==1.16.0
    smart-open==5.2.1
    smmap==5.0.0
    spacy==3.2.4
    spacy-alignments==0.8.5
    spacy-legacy==3.0.9
    spacy-loggers==1.0.2
    spacy-sentence-bert==0.1.2
    spacy-transformers==1.1.5
    srsly==2.4.3
    tensorboardX==2.5
    termcolor==1.1.0
    thinc==8.0.16
    threadpoolctl==3.1.0
    tokenizers==0.12.1
    tomli==2.0.1
    torch==1.10.2
    torchaudio==0.10.2
    torchvision==0.11.3
    tqdm==4.64.0
    transformers==4.17.0
    typer==0.4.1
    typing-extensions==4.2.0
    urllib3==1.26.9
    wandb==0.12.16
    wasabi==0.9.1
    wcwidth==0.2.5
    word2number==1.1
    xxhash==3.0.0
    yarl==1.7.2
    

    Do you have any recommendations? Is there an installation step missing?

    Thanks in advance!

    opened by alexander-belikov 4
  • Comparatively high initial prediction time for first predict() hit

    Comparatively high initial prediction time for first predict() hit

    I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

    predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

    Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

    opened by nemeer 2
  • Why does this package need to install google cloud auth, storage, api etc?

    Why does this package need to install google cloud auth, storage, api etc?

    Hi,

    after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

    Is there a reason (I don't get) why this library needs these packages?

    Thanks

    opened by GiacomoCherry 1
  • HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

    requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

    opened by jscoder1009 1
  • Retrieving cluster heads without replacing corefs

    Retrieving cluster heads without replacing corefs

    I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

    opened by MikeMikeMikeMike 1
  • [Errno 101] Network is unreachable

    [Errno 101] Network is unreachable

    Hello, when I try to run the code below

    predictor = Predictor(
        language="en_core_web_sm", device=1, model_name="info_xlm"
    )
    

    I get the following error:

    ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

    Is this url still valid and what should I use instead?

    opened by ttranslit 1
  • spaCy issues and suggestions

    spaCy issues and suggestions

    @martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

    • batching
    • empty cluster issue (resolved)
    • additional model pro-drop languages
    bug enhancement 
    opened by davidberenstein1957 0
  • feat: look into ONNX enhanched transformer embeddings

    feat: look into ONNX enhanched transformer embeddings

    Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

    enhancement 
    opened by davidberenstein1957 3
Releases(0.2.9)
Owner
Pandora Intelligence
Pandora Intelligence is an independent intelligence company, specialized in security risks.
Pandora Intelligence
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 01, 2023
Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

Todd Birchard 3 Jan 17, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

words-per-minute A terminal app written in python utilizing the curses module th

Tanim Islam 1 Jan 14, 2022
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022