Official Stanford NLP Python Library for Many Human Languages

Overview

Stanza: A Python NLP Library for Many Human Languages

The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. For detailed information please visit our official website.

🔥  A new collection of biomedical and clinical English model packages are now available, offering seamless experience for syntactic analysis and named entity recognition (NER) from biomedical literature text and clinical notes. For more information, check out our Biomedical models documentation page.

References

If you use this library in your research, please kindly cite our ACL2020 Stanza system demo paper:

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

If you use our biomedical and clinical models, please also cite our Stanza Biomedical Models description paper:

@article{zhang2020biomedical,
  title={Biomedical and Clinical English Model Packages in the Stanza Python NLP Library},
  author={Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D. and Langlotz, Curtis P.},
  journal={arXiv preprint arXiv:2007.14640},
  year={2020}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi, Yuhao Zhang, and Yuhui Zhang, with help from Jason Bolton, Tim Dozat and John Bauer. Maintenance of this repo is currently led by John Bauer.

If you use the CoreNLP software through Stanza, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

Issues and Usage Q&A

To ask questions, report issues or request features 🤔 , please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem, or visit the Frequently Asked Questions (FAQ) page on our website.

Contributing to Stanza

We welcome community contributions to Stanza in the form of bugfixes 🛠️ and enhancements 💡 ! If you want to contribute, please first read our contribution guideline.

Installation

pip

Stanza supports Python 3.6 or later. We recommend that you install Stanza via pip, the Python package manager. To install, simply run:

pip install stanza

This should also help resolve all of the dependencies of Stanza, for instance PyTorch 1.3.0 or above.

If you currently have a previous version of stanza installed, use:

pip install stanza -U

Anaconda

To install Stanza via Anaconda, use the following conda command:

conda install -c stanfordnlp stanza

Note that for now installing Stanza via Anaconda does not work for Python 3.8. For Python 3.8 please use pip installation.

From Source

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza. For this option, run

git clone https://github.com/stanfordnlp/stanza.git
cd stanza
pip install -e .

Running Stanza

Getting Started with the neural pipeline

To run your first Stanza pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanza
>>> stanza.download('en')       # This downloads the English models for the neural pipeline
>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

See our getting started guide for more details.

Accessing Java Stanford CoreNLP software

Aside from the neural pipeline, this package also includes an official wrapper for accessing the Java Stanford CoreNLP software with Python code.

There are a few initial setup steps.

  • Download Stanford CoreNLP and models for the language you wish to use
  • Put the model jars in the distribution folder
  • Tell the Python code where Stanford CoreNLP is located by setting the CORENLP_HOME environment variable (e.g., in *nix): export CORENLP_HOME=/path/to/stanford-corenlp-4.1.0

We provide comprehensive examples in our documentation that show how one can use CoreNLP through Stanza and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

  • Go to the Google Colab website
  • Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
  • Note that you do not need to give Colab access permission to your github account
  • Type stanfordnlp/stanza in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the Universal Dependencies treebanks v2.5, as well as NER models for a few widely-spoken languages. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library can be trained with your own data. The tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser require CoNLL-U formatted data, while the NER model requires the BIOES format. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and run training from the source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

Stanza is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Comments
  • google.protobuf.message.DecodeError: Error parsing message

    google.protobuf.message.DecodeError: Error parsing message

    Description I think this is similar to a bug in the old python library: python-stanford-corenlp. I'm trying to copy the demo for the client hereor here. but with my own texts... text2 works and text3 doesn't, the only differemce between them in the very last word.

    The error I get is:

    Traceback (most recent call last):
      File "C:/gitProjects/patentmoto2/scratch4.py", line 23, in <module>
        ann = client.annotate(text)
      File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 403, in annotate
        parseFromDelimitedString(doc, r.content)
      File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
        obj.ParseFromString(buf[offset+pos:offset+pos+size])
    google.protobuf.message.DecodeError: Error parsing message
    

    To Reproduce

    Steps to reproduce the behavior:

    
    print('---')
    print('input text')
    print('')
    
    text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
    text2 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and."
    text3 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and his."
    
    text = text3
    print(text)
    
    
    print('---')
    print('starting up Java Stanford CoreNLP Server...')
    
    
    with CoreNLPClient(endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                       timeout=70000, memory='16G', threads=10, be_quiet=False) as client:
    
        ann = client.annotate(text)
    
    
        sentence = ann.sentence[0]
    
    
        print('---')
        print('constituency parse of first sentence')
        constituency_parse = sentence.parseTree
        print(constituency_parse)
    

    Expected behavior I expect it to finish. text=text2 succeeds, but text=text3 fails with the above error. The only difference between the texts is the last word 'his' (could really be anything I think).

    Environment:

    • OS: Windows 10
    • Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
    • CoreNLP 3.9.2
    • corenlp-protobuf==3.8.0
    • protobuf==3.10.0
    • stanfordnlp==0.2.0
    • torch==1.1.0

    Additional context I've also gotten a timeout error for some sentences, but it's intermittent. I'm not sure of they're related, but this is easier to reproduce.

    bug awaiting feedback 
    opened by legolego 41
  • FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    Hi, a couple of questions that are related.

    I'm trying to train a new model for a new language, but I'm first trying the data included in the packages to know more about how Stanza works when training data.

    When I run the command

    python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST

    the following error appears:

    (nlp) [email protected] oe_lemmatizer_stanza % python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST 2022-06-27 16:45:52 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1136, in <module> main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 134, in main process_treebank(treebank, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1116, in process_treebank train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail=True) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 37, in find_treebank_dataset_file raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename)) FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

    The path I am using is the exact one that comes with the package when cloning it from GitHub. My idea is to replace the files with my own ones. I have tried closed issues about some similar errors to this one, but the solutions are not applicable to my problem.

    Also, I'm following the documentation for this in https://stanfordnlp.github.io/stanza/training.html#converting-ud-data, but no info is given about the train, test, and dev data. Is the script going to generate the dev and test ones? Do I need to generate them? I'm new to this, and the language I'm trying to add is not in the Universal Dependencies, I have found some datasets in .conll format, which I have converted to .conllu following Stanza documentation.

    Any ideas?

    Thanks!

    question 
    opened by dmetola 40
  • "AnnotationException: Could not handle incoming annotation" Problem [QUESTION]

    Greeting,

    I am new to CoreNLP enviroment and trying run the example code given on documentation. However, I got two errors as follows;

    First code: from stanza.server import CoreNLPClient with CoreNLPClient( annotators=['tokenize','ssplit','pos',"ner"], timeout=30000, memory='2G',be_quiet=True) as client: anno = client.annotate(text)

    2020-12-30 16:40:53 INFO: Writing properties to tmp file: corenlp_server-a15136448b834f79.props 2020-12-30 16:40:53 INFO: Starting server with command: java -Xmx2G -cp C:\Users\fatih\stanza_corenlp* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-a15136448b834f79.props -annotators tokenize,ssplit,pos,ner -preload -outputFormat serialized

    `Traceback (most recent call last):
    
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 446, in _request
        r.raise_for_status()
      File "C:\Users\fatih\anaconda3\lib\site-packages\requests\models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/?properties=%7B%27annotators%27%3A+%27tokenize%2Cssplit%2Cpos%2Cner%27%2C+%27outputFormat%27%3A+%27serialized%27%7D&resetDefault=false
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "<ipython-input-6-2fbdcdb77b41>", line 6, in <module>
        anno = client.annotate(text)
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 514, in annotate
        r = self._request(text.encode('utf-8'), request_properties, reset_default, **kwargs)
      File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 452, in _request
        raise AnnotationException(r.text)
    AnnotationException: Could not handle incoming annotation`
    

    What am I doing wrong? It's on windows, Anaconda, Spyder.

    question 
    opened by fatihbozdag 38
  • How can i run multiple stanza NER models parallel to eachother?

    How can i run multiple stanza NER models parallel to eachother?

    I want to run multiple stanza NER models, but i want to run them parallel to each other? how can I do so? I tried to do this using torch multiprocessing by creating multiple processes and each process run each models but it doesn't seem to go well

    processes = [] for i in range(4): # No. of processes p = mp.Process(target=test, args=(model,)) p.start() processes.append(p) for p in processes: p.join()

    question fixed on dev 
    opened by mennatallah644 33
  • Dependency parsing in StanfordCoreNLP  and Stanza giving different result

    Dependency parsing in StanfordCoreNLP and Stanza giving different result

    I did dependency parsing using StanfordCoreNLP using the code below

    from stanfordcorenlp import StanfordCoreNLP
    nlp = StanfordCoreNLP('stanford-corenlp-full-2018-10-05', lang='en')
    
    sentence = 'The clothes in the dressing room are gorgeous. Can I have one?'
    tree_str = nlp.parse(sentence)
    print(tree_str)
    

    And I got the output:

      (S
        (NP
          (NP (DT The) (NNS clothes))
          (PP (IN in)
            (NP (DT the) (VBG dressing) (NN room))))
        (VP (VBP are)
          (ADJP (JJ gorgeous)))
        (. .)))
    

    How can I get this same output in Stanza??

    import stanza
    from stanza.server import CoreNLPClient
    classpath='/stanford-corenlp-full-2020-04-20/*'
    client = CoreNLPClient(be_quite=False, classpath=classpath, annotators=['parse'], memory='4G', endpoint='http://localhost:8900')
    client.start()
    text = 'The clothes in the dressing room are gorgeous. Can I have one?'
    ann = client.annotate(text)
    sentence = ann.sentence[0]
    dependency_parse = sentence.basicDependencies
    print(dependency_parse)
    
    

    In stanza It appears I have to split the sentences that makes up the sentence. Is there something I am doing wrong?

    Please note that my objective is to extract noun phrases.

    question 
    opened by ajesujoba 31
  • PermanentlyFailedException: Timed out waiting for service to come alive. Part3

    PermanentlyFailedException: Timed out waiting for service to come alive. Part3

    Hi! I know this is similar to #52 and #91 but I am unable to understand how that was solved.

    When I run it on the commandline (Ubuntu : Ubuntu 16.04.6 LTS), it runs with success as below:

    java -Xmx16G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-34d0c1fe4d724a56.props -preload tokenize,ssplit,pos,lemma,ner
    
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - setting default constituency parser
    [main] INFO CoreNLP - using SR parser: edu/stanford/nlp/models/srparser/englishSR.ser.gz
    [main] INFO CoreNLP -     Threads: 5
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
    [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
    [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.2 sec].
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
    [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.7 sec].
    [main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
    [main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
    [main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
    [main] INFO CoreNLP - Starting server...
    [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
    
    

    But when I run it with python script, it fail with error as below:

    
    import os
    os.environ["CORENLP_HOME"] = '/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05'
    
    # Import client module
    from stanza.server import CoreNLPClient
    
    
    client = CoreNLPClient(be_quite=False, classpath='"/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*"', annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], memory='16G', endpoint='http://localhost:9000')
    print(client)
    
    client.start()
    #import time; time.sleep(10)
    
    text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
    print (text)
    document = client.annotate(text)
    print ('malviya')
    print(type(document))
    

    Error:

    <stanza.server.client.CoreNLPClient object at 0x7fd296e40d68>
    Starting server with command: java -Xmx4G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05"/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-9a4ccb63339146d0.props -preload tokenize,ssplit,pos,lemma,ner
    Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.
    
    Traceback (most recent call last):
      File "stanza_eng.py", line 18, in <module>
        document = client.annotate(text)
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 431, in annotate
        r = self._request(text.encode('utf-8'), request_properties, **kwargs)
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 342, in _request
        self.ensure_alive()
      File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 161, in ensure_alive
        raise PermanentlyFailedException("Timed out waiting for service to come alive.")
    stanza.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.
    
    

    Python 3.6.10 asn1crypto==1.3.0 certifi==2020.4.5.1 cffi==1.14.0 chardet==3.0.4 cryptography==2.8 embeddings==0.0.8 gast==0.2.2 idna==2.9 numpy==1.18.2 protobuf==3.11.3 pycparser==2.20 pyOpenSSL==19.1.0 PySocks==1.7.1 requests==2.23.0 six==1.14.0 stanza==1.0.0 torch==1.4.0 tqdm==4.44.1 urllib3==1.25.8 vocab==0.0.4

    I am unable to understand the issue here...

    awaiting feedback 
    opened by skmalviya 31
  • Users from China suffer from connection issue when downloading Stanza models

    Users from China suffer from connection issue when downloading Stanza models

    Hi, there

    Could you help me to trace this issue? Here is my some info:

    • Network is okay without limitations
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import stanza
    
    if __name__ == '__main__':
        # https://github.com/stanfordnlp/stanza/blob/master/demo/Stanza_Beginners_Guide.ipynb
        # Note that you can use verbose=False to turn off all printed messages
        print("Downloading Chinese model...")
        stanza.download('zh', verbose=True)
    
        # Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
        print("Building a Chinese pipeline...")
        zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=True, use_gpu=False)
    
    C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\Scripts\python.exe C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py
    Downloading Chinese model...
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
        conn = connection.create_connection(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
        raise err
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
        sock.connect(sa)
    TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
        httplib_response = self._make_request(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
        self._validate_conn(conn)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 976, in _validate_conn
        conn.connect()
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 308, in connect
        conn = self._new_conn()
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
        raise NewConnectionError(
    urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 439, in send
        resp = conn.urlopen(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 724, in urlopen
        retries = retries.increment(
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\retry.py", line 439, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py", line 9, in <module>
        stanza.download('zh', verbose=True)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 223, in download
        request_file(f'{DEFAULT_RESOURCES_URL}/resources_{__resources_version__}.json', os.path.join(dir, 'resources.json'))
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 83, in request_file
        download_file(url, path)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 66, in download_file
        r = requests.get(url, stream=True)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 530, in request
        resp = self.send(prep, **send_kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 643, in send
        r = adapter.send(request, **kwargs)
      File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
    
    Process finished with exit code 1
    
    
    enhancement question 
    opened by pplmx 30
  • [QUESTION] Can I run Stanza inside Docker container?

    [QUESTION] Can I run Stanza inside Docker container?

    Can I run Stanza inside docker container? I Created a container, installed all the dependencies, when the interpreter reaches the call [word.lemma for sent in doc_stanza.sentences for word in sent.words] the program just freezes without errors.

    question stale 
    opened by malakhovks 29
  • MWT and Pretokenized Text for Italian

    MWT and Pretokenized Text for Italian

    Hello! I'm using Stanza for Italian and I'm trying to generate a pred file starting with a gold file. Unfortunately, if I start with pretokenized text the new pipeline doesn't read mwt tokens, so I can't have file aligned. I saw a similar question (#95), but I don't think the problem has been solved... Can anyone help me?

    question 
    opened by manuelfavaro90 28
  • ValueError: substring not found

    ValueError: substring not found

    Describe the bug when use the Vietnamese's POS, there have this problem To Reproduce Steps to reproduce the behavior:

    1. read the sentences s;
    2. call nlp(s); 3.'ValueError: substring not found' come out then.

    Environment (please complete the following information):

    • OS: CentOS
    • Python version: Python 3.6.8
    • Stanza version: 1.1.1

    Additional context

    bug fixed on dev 
    opened by pipiman 28
  • Is there an API to update existing NER models?

    Is there an API to update existing NER models?

    I have found documentation to be able to train NER models from scratch, but is there an API that'd allow one to update an existing model locally, adding both fresh text and annotations or fresh labels, onto say i2b2 or radiology?

    question fixed on dev 
    opened by snehasaisneha 24
  • Error in retraining UD Arabic-PADT data

    Error in retraining UD Arabic-PADT data

    I am trying to retrian Arabic-PADT data with some corrections, but I get an error while preparing mwt Tokenization is trained just fine, but mwt, after starting, stops with an error.

    [email protected]:/data# python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_Arabic-PADT
    2023-01-04 13:45:32 INFO: Datasets program called with:
    /usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py UD_Arabic-PADT
    Preparing data for UD_Arabic-PADT: ar_padt, ar
    Reading from /data/UD_Arabic-PADT/ar_padt-ud-train.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.train.gold.conllu
    Reading from /data/UD_Arabic-PADT/ar_padt-ud-dev.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.dev.gold.conllu
    Reading from /data/UD_Arabic-PADT/ar_padt-ud-test.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.test.gold.conllu
    11766 unique MWTs found in data
    2480 unique MWTs found in data
    2426 unique MWTs found in data
    Traceback (most recent call last):
      File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 63, in <module>
        main()
      File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 60, in main
        common.main(process_treebank)
      File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/common.py", line 257, in main
        process_treebank(treebank, paths, args)
      File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 49, in process_treebank
        source_filename = prepare_tokenizer_treebank.mwt_name(tokenizer_dir, short_name, shard)
    AttributeError: module 'stanza.utils.datasets.prepare_tokenizer_treebank' has no attribute 'mwt_name'
    [email protected]:/data# python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_Arabic-PADT
    

    I am running Python 3.8 under Ubuntu 20.04 in a docker container. Stanza is installed through pip.

    Any hint?

    Thank you,

    Giuliano

    bug duplicate fixed on dev 
    opened by lancioni 2
  • 1.4.0 is buggy when it comes to some dependency parsing tasks, however, 1.3.0 works correctly

    1.4.0 is buggy when it comes to some dependency parsing tasks, however, 1.3.0 works correctly

    I am using the dependency parser and noticed 1.4.0 has bugs that do not exist in 1.3.0. Here is an example:

    If B is true and if C is false, perform D; else, perform E and perform F

    in 1.3.0, 'else' is correctly detected as a child of the 'perform' coming after it; however, in 1.4.0, it is detected as a child of the 'perform' before it.

    How can I force Stanza to load 1.3.0 instead of the latest version, so I can move forward with what I am doing now?

    bug 
    opened by apsyio 2
  • CUDA devide-side assert is thrown unpredictably

    CUDA devide-side assert is thrown unpredictably

    Describe the bug I'm using Stanza to do sentence splitting and other preprocessing as a part of a machine translation pipeline. At random times, my server starts to throw errors for about half of the requests. The problem vanishes after server is restarted. The error is always the same:

    File "/var/app/current/app/translator.py", line 24, in _split_sentences
      sents = self.nlp(text).sentences
    File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/core.py", line 386, in __call__
      return self.process(doc, processors)
    File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/core.py", line 382, in process
      doc = process(doc)
    File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/tokenize_processor.py", line 87, in process
      _, _, _, document = output_predictions(None, self.trainer, batches, self.vocab, None,
    File "/var/app/venv/lib/python3.8/site-packages/stanza/models/tokenization/utils.py", line 273, in output_predictions
      pred = np.argmax(trainer.predict(batch), axis=2)
    File "/var/app/venv/lib/python3.8/site-packages/stanza/models/tokenization/trainer.py", line 66, in predict
      units = units.cuda()
    RuntimeError: CUDA error: device-side assert triggered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    

    Most of times there are no errors. Since the errors happen in production and at random times, I haven't been able to reproduce them or debug them properly. I'm unsure how I should proceed.

    To Reproduce I don't know how to reproduce this, as it happens randomly.

    My code is something like this (non-relevant parts redacted):

    def __init__(self, source_lang: str, target_lang: str):
            self.nlp = stanza.Pipeline(lang=source_lang, processors="tokenize")
            # ...
    
    def _split_sentences(self, text: str):
            sents = self.nlp(text).sentences
            # other processing ...
    

    Only one stanza.Pipeline object is created by the server process.

    Expected behavior There should be no errors.

    Environment (please complete the following information): The server is an Amazon EC2 instance.

    • OS: Amazon Linux 2/3.3.16
    • Python version: Python 3.8 running on 64bit
    • Stanza version: 1.4.0
    bug 
    opened by fergusq 4
  • Provide a list of bracket symbols?

    Provide a list of bracket symbols?

    Could you kindly provide a list of bracket symbols you use in the constituency module? I know it's from Penn Treebank but it's very hard to find a complete list. E.g., most online sources don't have NML. And I'm not sure what the separated dashes are noted in the model output.

    opened by ningkko 1
  • [QUESTION] Models of Old Church Slavonic and encoding

    [QUESTION] Models of Old Church Slavonic and encoding

    I am using Stanza to analyze Old Church Slavonic texts and extract lemmata and dependencies. Therefore, I wonder what resources (texts) were used to build pretrained models and how many. Is it possible to enhance lemmata manually, for example, if some changes are necessary ?

    There is a problem with how to encode Old Church Slavonic words -- there is not only an alphabet to consider but also diacritic symbols. What approach do you use?

    question 
    opened by osherenko 1
Releases(v1.4.2)
  • v1.4.2(Sep 15, 2022)

    Stanza v1.4.2: Minor version bump to improve (python) dependencies

    • Pipeline cache in Multilingual is a single OrderedDict https://github.com/stanfordnlp/stanza/issues/1115#issuecomment-1239759362 https://github.com/stanfordnlp/stanza/commit/ba3f64d5f571b1dc70121551364fc89d103ca1cd

    • Don't require pytest for all installations unless needed for testing https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/8c1d9d80e2e12729f60f05b81e88e113fbdd3482

    • hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/6a90ad4bacf923c88438da53219c48355b847ed3

    • Reorder & normalize installations in setup.py https://github.com/stanfordnlp/stanza/pull/1124

    Source code(tar.gz)
    Source code(zip)
  • v1.4.1(Sep 14, 2022)

    Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

    Overview

    We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

    New NER models

    • New Polish NER model based on NKJP from Karol Saputa and ryszardtuora https://github.com/stanfordnlp/stanza/issues/1070 https://github.com/stanfordnlp/stanza/pull/1110

    • Make GermEval2014 the default German NER model, including an optional Bert version https://github.com/stanfordnlp/stanza/issues/1018 https://github.com/stanfordnlp/stanza/pull/1022

    • Japanese conversion of GSD by Megagon https://github.com/stanfordnlp/stanza/pull/1038

    • Marathi NER dataset from L3Cube. Includes a Sentiment model as well https://github.com/stanfordnlp/stanza/pull/1043

    • Thai conversion of LST20 https://github.com/stanfordnlp/stanza/commit/555fc0342decad70f36f501a7ea1e29fa0c5b317

    • Kazakh conversion of KazNERD https://github.com/stanfordnlp/stanza/pull/1091/commits/de6cd25c2e5b936bc4ad2764b7b67751d0b862d7

    Other new models

    • Sentiment conversion of Tass2020 for Spanish https://github.com/stanfordnlp/stanza/pull/1104

    • VIT constituency dataset for Italian https://github.com/stanfordnlp/stanza/pull/1091/commits/149f1440dc32d47fbabcc498cfcd316e53aca0c6 ... and many subsequent updates

    • Combined UD models for Hebrew https://github.com/stanfordnlp/stanza/issues/1109 https://github.com/stanfordnlp/stanza/commit/e4fcf003feb984f535371fb91c9e380dd187fd12

    • For UD models with small train dataset & larger test dataset, flip the datasets UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL https://github.com/stanfordnlp/stanza/issues/1030 https://github.com/stanfordnlp/stanza/commit/9618d60d63c49ec1bfff7416e3f1ad87300c7073

    • Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF https://github.com/stanfordnlp/stanza/commit/47740c6252a6717f12ef1fde875cf19fa1cd67cc

    Model improvements

    • Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost https://github.com/stanfordnlp/stanza/pull/1086

    • Pretrained charlm integrated into Sentiment. Improves English, others not so much https://github.com/stanfordnlp/stanza/pull/1025

    • LSTM, 2d maxpool as optional items in the Sentiment from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling https://github.com/stanfordnlp/stanza/pull/1098

    • First learn with AdaDelta, then with another optimizer in conparse training. Very helpful https://github.com/stanfordnlp/stanza/commit/b1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52

    • Grad clipping in conparse training https://github.com/stanfordnlp/stanza/commit/365066add019096332bcba0da4a626f68b70d303

    Pipeline interface improvements

    • GPU memory savings: charlm reused between different processors in the same pipeline https://github.com/stanfordnlp/stanza/pull/1028

    • Word vectors not saved in the NER models. Saves bandwidth & disk space https://github.com/stanfordnlp/stanza/pull/1033

    • Functions to return tagsets for NER and conparse models https://github.com/stanfordnlp/stanza/issues/1066 https://github.com/stanfordnlp/stanza/pull/1073 https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0 https://github.com/stanfordnlp/stanza/commit/2db43c834bc8adbb8b096cf135f0fab8b8d886cb

    • displaCy integration with NER and dependency trees https://github.com/stanfordnlp/stanza/commit/20714137d81e5e63d2bcee420b22c4fd2a871306

    Bugfixes

    • Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex) TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook) https://github.com/stanfordnlp/stanza/pull/1056

    • Starting a new corenlp client w/o server shouldn't wait for the server to be available TY to Mariano Crosetti https://github.com/stanfordnlp/stanza/issues/1059 https://github.com/stanfordnlp/stanza/pull/1061

    • Read raw glove word vectors (they have no header information) https://github.com/stanfordnlp/stanza/pull/1074

    • Ensure that illegal languages are not chosen by the LangID model https://github.com/stanfordnlp/stanza/issues/1076 https://github.com/stanfordnlp/stanza/pull/1077

    • Fix cache in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1115 https://github.com/stanfordnlp/stanza/commit/cdf18d8b19c92b0cfbbf987e82b0080ea7b4db32

    • Fix loading of previously unseen languages in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1101 https://github.com/stanfordnlp/stanza/commit/e551ebe60a4d818bc5ba8880dda741cc8bd1aed7

    • Fix that conparse would occasionally train to NaN early in the training https://github.com/stanfordnlp/stanza/commit/c4d785729e42ac90f298e0ef4ab487d14fa35591

    Improved training tools

    • W&B integration for all models: can be activated with --wandb flag in the training scripts https://github.com/stanfordnlp/stanza/pull/1040

    • New webpages for building charlm, NER, and Sentiment https://stanfordnlp.github.io/stanza/new_language_charlm.html https://stanfordnlp.github.io/stanza/new_language_ner.html https://stanfordnlp.github.io/stanza/new_language_sentiment.html

    • Script to download Oscar 2019 data for charlm from HF (requires datasets module) https://github.com/stanfordnlp/stanza/pull/1014

    • Unify sentiment training into a Python script, replacing the old shell script https://github.com/stanfordnlp/stanza/pull/1021 https://github.com/stanfordnlp/stanza/pull/1023

    • Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese https://github.com/stanfordnlp/stanza/pull/1024

    • Slightly faster charlm training https://github.com/stanfordnlp/stanza/pull/1026

    • Data conversion of WikiNER generalized for retraining / add new WikiNER models https://github.com/stanfordnlp/stanza/pull/1039

    • XPOS factory now determined at start of POS training. Makes addition of new languages easier https://github.com/stanfordnlp/stanza/pull/1082

    • Checkpointing and continued training for charlm, conparse, sentiment https://github.com/stanfordnlp/stanza/pull/1090 https://github.com/stanfordnlp/stanza/commit/0e6de808eacf14cd64622415eeaeeac2d60faab2 https://github.com/stanfordnlp/stanza/commit/e5793c9dd5359f7e8f4fe82bf318a2f8fd190f54

    • Option to write the results of a NER model to a file https://github.com/stanfordnlp/stanza/pull/1108

    • Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools https://github.com/stanfordnlp/stanza/commit/6544ef3fa5e4f1b7f06dbcc5521fbf9b1264197a

    • Convert an AMT NER result to Stanza .json https://github.com/stanfordnlp/stanza/commit/cfa7e496ca7c7662478e03c5565e1b2b2c026fad

    • Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters https://github.com/stanfordnlp/stanza/commit/5a5e9187f81bd76fcd84ad713b51215b64234986 https://github.com/stanfordnlp/stanza/commit/b32a98e477e9972737ad64deea0bda8d6cebb4ec and others

    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Apr 23, 2022)

    Stanza v1.4.0: Transformer integration to NER and conparse

    Overview

    As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

    Pipeline interface improvements

    • Download resources.json and models into temp dirs first to avoid race conditions between multiple processors https://github.com/stanfordnlp/stanza/issues/213 https://github.com/stanfordnlp/stanza/pull/1001

    • Download models for Pipelines automatically, without needing to call stanza.download(...) https://github.com/stanfordnlp/stanza/issues/486 https://github.com/stanfordnlp/stanza/pull/943

    • Add ability to turn off downloads https://github.com/stanfordnlp/stanza/commit/68455d895986357a2c1f496e52c4e59ee0feb165

    • Add a new interface where both processors and package can be set https://github.com/stanfordnlp/stanza/issues/917 https://github.com/stanfordnlp/stanza/commit/f37042924b7665bbaf006b02dcbf8904d71931a1

    • When using pretokenized tokens, get character offsets from text if available https://github.com/stanfordnlp/stanza/issues/967 https://github.com/stanfordnlp/stanza/pull/975

    • If Bert or other transformers are used, cache the models rather than loading multiple times https://github.com/stanfordnlp/stanza/pull/980

    • Allow for disabling processors on individual runs of a pipeline https://github.com/stanfordnlp/stanza/issues/945 https://github.com/stanfordnlp/stanza/pull/947

    Other general improvements

    • Add # text and # sent_id to conll output https://github.com/stanfordnlp/stanza/discussions/918 https://github.com/stanfordnlp/stanza/pull/983 https://github.com/stanfordnlp/stanza/pull/995

    • Add ner to the token conll output https://github.com/stanfordnlp/stanza/discussions/993 https://github.com/stanfordnlp/stanza/pull/996

    • Fix missing Slovak MWT model https://github.com/stanfordnlp/stanza/issues/971 https://github.com/stanfordnlp/stanza/commit/5aa19ec2e6bc610576bc12d226d6f247a21dbd75

    • Upgrades to EN, IT, and Indonesian models https://github.com/stanfordnlp/stanza/issues/1003 https://github.com/stanfordnlp/stanza/pull/1008 IT improvements with the help of @attardi and @msimi

    • Fix improper tokenization of Chinese text with leading whitespace https://github.com/stanfordnlp/stanza/issues/920 https://github.com/stanfordnlp/stanza/pull/924

    • Check if a CoreNLP model exists before downloading it (thank you @interNULL) https://github.com/stanfordnlp/stanza/pull/965

    • Convert the run_charlm script to python https://github.com/stanfordnlp/stanza/pull/942

    • Typing and lint fixes (thank you @asears) https://github.com/stanfordnlp/stanza/pull/833 https://github.com/stanfordnlp/stanza/pull/856

    • stanza-train examples now compatible with the python training scripts https://github.com/stanfordnlp/stanza/issues/896

    NER features

    • Bert integration (not by default, thank you @vythaihn) https://github.com/stanfordnlp/stanza/pull/976

    • Swedish model (thank you @EmilStenstrom) https://github.com/stanfordnlp/stanza/issues/912 https://github.com/stanfordnlp/stanza/pull/857

    • Persian model https://github.com/stanfordnlp/stanza/issues/797

    • Danish model https://github.com/stanfordnlp/stanza/pull/910/commits/3783cc494ee8c6b6d062c4d652a428a04a4ee839

    • Norwegian model (both NB and NN) https://github.com/stanfordnlp/stanza/pull/910/commits/31fa23e5239b10edca8ecea46e2114f9cc7b031d

    • Use updated Ukrainian data (thank you @gawy) https://github.com/stanfordnlp/stanza/pull/873

    • Myanmar model (thank you UCSY) https://github.com/stanfordnlp/stanza/pull/845

    • Training improvements for finetuning models https://github.com/stanfordnlp/stanza/issues/788 https://github.com/stanfordnlp/stanza/pull/791

    • Fix inconsistencies in B/S/I/E tags https://github.com/stanfordnlp/stanza/issues/928#issuecomment-1027987531 https://github.com/stanfordnlp/stanza/pull/961

    • Add an option for multiple NER models at the same time, merging the results together https://github.com/stanfordnlp/stanza/issues/928 https://github.com/stanfordnlp/stanza/pull/955

    Constituency parser

    • Dynamic oracle (improves accuracy a bit) https://github.com/stanfordnlp/stanza/pull/866

    • Missing tags now okay in the parser https://github.com/stanfordnlp/stanza/issues/862 https://github.com/stanfordnlp/stanza/commit/04dbf4f65e417a2ceb19897ab62c4cf293187c0b

    • bugfix of () not being escaped when output in a tree https://github.com/stanfordnlp/stanza/commit/eaf134ca699aca158dc6e706878037a20bc8cbd4

    • charlm integration by default https://github.com/stanfordnlp/stanza/pull/799

    • Bert integration (not the default model) (thank you @vythaihn and @hungbui0411) https://github.com/stanfordnlp/stanza/commit/05a0b04ee6dd701ca1c7c60197be62d4c13b17b6 https://github.com/stanfordnlp/stanza/commit/0bbe8d10f895560a2bf16f542d2e3586d5d45b7e

    • Preemptive bugfix for incompatible devices from @zhaochaocs https://github.com/stanfordnlp/stanza/issues/989 https://github.com/stanfordnlp/stanza/pull/1002

    • New models: DA, based on Arboretum IT, based on the Turin treebank JA, based on ALT PT, based on Cintil TR, based on Starlang ZH, based on CTB7

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Oct 6, 2021)

    Overview

    Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.

    New features

    • Langid model and multilingual pipeline Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021 (https://github.com/stanfordnlp/stanza/commit/154b0e8e59d3276744ae0c8ea56dc226f777fba8)

    • Constituency parser Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently an en_wsj model available, with more to come. (https://github.com/stanfordnlp/stanza/commit/90318023432d584c62986123ef414a1fa93683ca)

    • Evalb interface to CoreNLP Useful for evaluating the parser - requires CoreNLP 4.3.0 or later

    • Dictonary tokenizer feature Noticeably improved performance for ZH, VI, TH (https://github.com/stanfordnlp/stanza/pull/776)

    Bugfixes / Reliability

    • HuggingFace integration No more git issues complaining about unavailable models! (Hopefully) (https://github.com/stanfordnlp/stanza/commit/f7af5049568f81a716106fee5403d339ca246f38)

    • Sentiment processor crashes on certain inputs (issue https://github.com/stanfordnlp/stanza/issues/804, fixed by https://github.com/stanfordnlp/stanza/commit/e232f67f3850a32a1b4f3a99e9eb4f5c5580c019)

    Source code(tar.gz)
    Source code(zip)
  • v1.2.3(Aug 9, 2021)

    Overview

    In anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.

    Bugfixes

    • Sentiment models would crash on no text (issue https://github.com/stanfordnlp/stanza/issues/769, fixed by https://github.com/stanfordnlp/stanza/pull/781/commits/47889e3043c27f9c5abd9913016929f1857de7bf)

    • Java processes as a context were not properly closed (https://github.com/stanfordnlp/stanza/pull/781/commits/a39d2ff6801a23aa73add1f710d809a9c0a793b1)

    Interface improvements

    • Downloading tokenize now downloads mwt for languages which require it (issue https://github.com/stanfordnlp/stanza/issues/774, fixed by https://github.com/stanfordnlp/stanza/pull/777, from davidrft)

    • NER model can finetune and save to/from different filenames (https://github.com/stanfordnlp/stanza/pull/781/commits/0714a0134f0af6ef486b49ce934f894536e31d43)

    • NER model now displays a confusion matrix at the end of training (https://github.com/stanfordnlp/stanza/pull/781/commits/9bbd3f712f97cb2702a0852e1c353d4d54b4b33b)

    NER models

    • Afrikaans, trained in NCHLT (https://github.com/stanfordnlp/stanza/pull/781/commits/6f1f04b6d674691cf9932d780da436063ebd3381)

    • Italian, trained on a model from FBK (https://github.com/stanfordnlp/stanza/pull/781/commits/d9a361fd7f13105b68569fddeab650ea9bd04b7f)

    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(Jul 15, 2021)

    Overview

    A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.

    Bugfixes

    • Fix Sentiment not loading correctly on Windows because of pickling issue (https://github.com/stanfordnlp/stanza/pull/742) (thanks to @BramVanroy)

    • Fix NER bulk process not filling out data structures as expected (https://github.com/stanfordnlp/stanza/issues/721) (https://github.com/stanfordnlp/stanza/pull/722)

    • Fix NER space issue causing a performance regression (https://github.com/stanfordnlp/stanza/issues/739) (https://github.com/stanfordnlp/stanza/pull/732)

    Interface improvements

    • Add an NER run script (https://github.com/stanfordnlp/stanza/pull/738)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jun 17, 2021)

    Overview

    All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

    Model improvements

    • Add Bulgarian, Finnish, Hungarian, Vietnamese NER models

      • The Bulgarian model is trained on BSNLP 2019 data.
      • The Finnish model is trained on the Turku NER data.
      • The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
      • The Vietnamese model is trained on the VLSP 2018 data.
      • Furthermore, the script for preparing the lang-uk NER data has been integrated (https://github.com/stanfordnlp/stanza/commit/c1f0bee1074997d9376adaec45dc00f813d00b38)
    • Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(https://github.com/stanfordnlp/stanza/pull/718/commits/d9e8301addc93450dc880b06cb665ad10d869242)

    • Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (https://github.com/stanfordnlp/stanza/pull/692 https://github.com/stanfordnlp/stanza/issues/684)

    • Fix Spanish POS and depparse mishandling a leading ¿ missing (https://github.com/stanfordnlp/stanza/pull/699 https://github.com/stanfordnlp/stanza/issues/698)

    • Fix tokenization breaking when a newline splits a Chinese token(https://github.com/stanfordnlp/stanza/pull/632 https://github.com/stanfordnlp/stanza/issues/531)

    • Fix tokenization of parentheses in Chinese(https://github.com/stanfordnlp/stanza/commit/452d842ed596bb7807e604eeb2295fd4742b7e89)

    • Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe (https://github.com/stanfordnlp/stanza/pull/719/commits/db0555253f0a68c76cf50209387dd2ff37794197 https://github.com/stanfordnlp/stanza/pull/719/commits/f01a1420755e3e0d9f4d7c2895e0261e581f7413 https://github.com/stanfordnlp/stanza/pull/719/commits/85898c50f14daed75b96eed9cd6e9d6f86e2d197)

    • Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (https://github.com/stanfordnlp/stanza/pull/719/commits/3ccb132e03ce28a9061ec17d2c0ae84cc2000548)

    • Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(https://github.com/stanfordnlp/stanza/pull/719/commits/197212269bc33b66759855a5addb99d1f465e4f4)

    • Include UD_English-GUMReddit in the GUM models(https://github.com/stanfordnlp/stanza/pull/719/commits/9e6367cb9bdd635d579fd8d389cb4d5fa121c413)

    • Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(https://github.com/stanfordnlp/stanza/pull/719/commits/f74bef7b2ed171bf9c027ae4dfd3a10272040a46)

    Interface enhancements

    • Add ability to pass a Document to the pipeline in pretokenized mode(https://github.com/stanfordnlp/stanza/commit/f88cd8c2f84aedeaec34a11b4bc27573657a66e2 https://github.com/stanfordnlp/stanza/issues/696)

    • Track comments when reading and writing conll files (https://github.com/stanfordnlp/stanza/pull/676 originally from @danielhers in https://github.com/stanfordnlp/stanza/pull/155)

    • Add a proxy parameter for downloads to pass through to the requests module (https://github.com/stanfordnlp/stanza/pull/638)

    • add sent_idx to tokens (https://github.com/stanfordnlp/stanza/commit/ee6135c538e24ff37d08b86f34668ccb223c49e1)

    Bugfixes

    • Fix Windows encoding issues when reading conll documents from @yanirmr (b40379eaf229e7ffc7580def57ee1fad46080261 https://github.com/stanfordnlp/stanza/pull/695)

    • Fix tokenization breaking when second batch is exactly eval_length(https://github.com/stanfordnlp/stanza/commit/726368644d7b1019825f915fabcfe1e4528e068e https://github.com/stanfordnlp/stanza/issues/634 https://github.com/stanfordnlp/stanza/issues/631)

    Efficiency improvements

    • Bulk process for tokenization - greatly speeds up the use case of many small docs (https://github.com/stanfordnlp/stanza/pull/719/commits/5d2d39ec822c65cb5f60d547357ad8b821683e3c)

    • Optimize MWT usage in pipeline & fix MWT bulk_process (https://github.com/stanfordnlp/stanza/pull/642 https://github.com/stanfordnlp/stanza/pull/643 https://github.com/stanfordnlp/stanza/pull/644)

    CoreNLP integration

    • Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer (https://github.com/stanfordnlp/stanza/pull/675)

    • Add an interface to CoreNLP tokensregex using stanza tokenization (https://github.com/stanfordnlp/stanza/pull/659)

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Jan 29, 2021)

    Overview

    All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

    New features and enhancements

    • Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.

    • NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)

    • Multi-document support The Stanza Pipeline now supports multi-Document input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza Document objects into the Pipeline. (https://github.com/stanfordnlp/stanza/issues/70 https://github.com/stanfordnlp/stanza/pull/577)

    • Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use token.sent or word.sent. (https://github.com/stanfordnlp/stanza/issues/533 https://github.com/stanfordnlp/stanza/pull/554)

    • New external tokenizer for Thai with PyThaiNLP Try it out with, for example, stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None). (https://github.com/stanfordnlp/stanza/pull/567)

    • Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)

    • Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try stanza.resources.common.list_available_languages(). (https://github.com/stanfordnlp/stanza/issues/511 https://github.com/stanfordnlp/stanza/commit/fa52f8562f20ab56807b35ba204d6f9ca60b47ab)

    • Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your Pipeline instantiation, and remembering to include it is a pain -- until now. (https://github.com/stanfordnlp/stanza/pull/516 https://github.com/stanfordnlp/stanza/issues/515 and many others)

    • Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from stanza.download("vi"). Enjoy!

    • More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (https://github.com/stanfordnlp/stanza/pull/437 https://github.com/stanfordnlp/stanza/issues/430 ... https://github.com/stanfordnlp/stanza/issues/324 https://github.com/stanfordnlp/stanza/pull/438 ... https://github.com/stanfordnlp/stanza/issues/529 https://github.com/stanfordnlp/stanza/commit/953966539c955951d01e3d6b4561fab02a1f546c ... https://github.com/stanfordnlp/stanza/issues/575 https://github.com/stanfordnlp/stanza/pull/578)

    Bugfixes

    • Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (https://github.com/stanfordnlp/stanza/commit/4ee9f12be5911bb600d2f162b1684cb4686c391e https://github.com/stanfordnlp/stanza/issues/559)

    • External tokenization library integration respects no_ssplit so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (https://github.com/stanfordnlp/stanza/issues/523 https://github.com/stanfordnlp/stanza/pull/556)

    • Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (https://github.com/stanfordnlp/stanza/issues/524 https://github.com/stanfordnlp/stanza/commit/ba0aec30e6e691155bc0226e4cdbb829cb3489df)

    • Spanish model would not tokenize foo,bar Now fixed (https://github.com/stanfordnlp/stanza/issues/528 https://github.com/stanfordnlp/stanza/commit/123d5029303a04185c5574b76fbed27cb992cadd)

    • Arabic model would not tokenize asdf . Now fixed (https://github.com/stanfordnlp/stanza/issues/545 https://github.com/stanfordnlp/stanza/commit/03b7ceacf73870b2a15b46479677f4914ea48745)

    • Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (https://github.com/stanfordnlp/stanza/issues/539 https://github.com/stanfordnlp/stanza/pull/588)

    • Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (https://github.com/stanfordnlp/stanza/issues/471 https://github.com/stanfordnlp/stanza/issues/488 https://github.com/stanfordnlp/stanza/pull/491)

    • Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (https://github.com/stanfordnlp/stanza/issues/305 https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal/pull/5)

    • Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (https://github.com/stanfordnlp/stanza/pull/484 https://github.com/stanfordnlp/stanza/issues/483, thanks @m0re4u )

    • Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (https://github.com/stanfordnlp/stanza/issues/458 https://github.com/stanfordnlp/stanza/commit/274474c3b0e4155ab6e221146ac347ca433f81a6)

    • Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (https://github.com/stanfordnlp/stanza/pull/434 https://github.com/stanfordnlp/stanza/issues/431)

    • Misc fixes for training, concurrency, and edge cases in basic Pipeline usage

      • Fix for mwt training (https://github.com/stanfordnlp/stanza/pull/446)
      • Fix for race condition in seq2seq models (https://github.com/stanfordnlp/stanza/pull/463 https://github.com/stanfordnlp/stanza/issues/462)
      • Fix for race condition in CRF (https://github.com/stanfordnlp/stanza/pull/566 https://github.com/stanfordnlp/stanza/issues/561)
      • Fix for empty text in pipeline (https://github.com/stanfordnlp/stanza/pull/475 https://github.com/stanfordnlp/stanza/issues/474)
      • Fix for resources not freed when downloading (https://github.com/stanfordnlp/stanza/issues/502 https://github.com/stanfordnlp/stanza/pull/503)
      • Fix for vietnamese pipeline not working (https://github.com/stanfordnlp/stanza/issues/531 https://github.com/stanfordnlp/stanza/pull/535)

    BREAKING CHANGES

    • Renamed stanza.models.tokenize -> stanza.models.tokenization https://github.com/stanfordnlp/stanza/pull/452 This stops the tokenize directory shadowing a built in library
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Aug 13, 2020)

    Overview

    This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

    New Features and Enhancements

    • New Sentiment Analysis Models for English, German, Chinese: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.

    • New Biomedical and Clinical English Model Packages: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.

    • Support for Adding User Customized Processors via Python Decorators: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via @register_processor or @register_processor_variant decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR https://github.com/stanfordnlp/stanza/pull/322)

    • Support for Editable Properties For Data Objects: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., Document, Sentence, Token, etc). Aside from the annotation they already support, additional annotation can be easily attached through data_object.add_property(). See our documentation for more information and examples. (PR https://github.com/stanfordnlp/stanza/pull/323)

    • Support for Automated CoreNLP Installation and CoreNLP Model Download: CoreNLP can now be easily downloaded in Stanza with stanza.install_corenlp(dir='path/to/corenlp/installation'); CoreNLP models can now be downloaded with stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation'). For more details please see the Stanza website. (PR https://github.com/stanfordnlp/stanza/pull/363)

    • Japanese Pipeline Supports SudachiPy as External Tokenizer: You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}. Note that this will require a separate installation of the SudachiPy library via pip. (PR https://github.com/stanfordnlp/stanza/pull/365)

    • New Alternative Server for Stable Download of Resource Files: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new resources_url argument. For example, stanza.download(lang='en', resources_url='stanford') will now download the resource file and English pipeline from Stanford servers. (Issue https://github.com/stanfordnlp/stanza/issues/331, PR https://github.com/stanfordnlp/stanza/pull/356)

    • CoreNLPClient Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server: The CoreNLPClient now supports a new Enum values with better semantics for its start_server argument for finer-grained control over how the server is launched, including a new option called StartServer.TRY_START that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for CoreNLPClient to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend StartServer.FORCE_START and StartSerer.DONT_START for better readability. (PR https://github.com/stanfordnlp/stanza/pull/302)

    • New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue https://github.com/stanfordnlp/stanza/issues/399, PR https://github.com/stanfordnlp/stanza/pull/392)

    • New Tokenizer for Thai Language: The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue https://github.com/stanfordnlp/stanza/issues/148)

    • Support for Serialization of Document Objects: Now you can serialize and deserialize the entire document by running serialized_string = doc.to_serialized() and doc = Document.from_serialized(serialized_string). The serialized string can be decoded into Python objects by running objs = pickle.loads(serialized_string). (Issue https://github.com/stanfordnlp/stanza/issues/361, PR https://github.com/stanfordnlp/stanza/pull/366)

    • Improved Tokenization Speed: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: https://github.com/stanfordnlp/stanza/commit/546ed13563c3530b414d64b5a815c0919ab0513a, https://github.com/stanfordnlp/stanza/commit/8e2076c6a0bc8890a54d9ed6931817b1536ae33c, https://github.com/stanfordnlp/stanza/commit/7f5be823a587c6d1bec63d47cd22818c838901e7, etc.)

    • User provided Ukrainian NER model: We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.

    Breaking Interface Changes

    • Token.id is Tuple and Word.id is Integer: The id attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the id for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: https://github.com/stanfordnlp/stanza/issues/211, PR: https://github.com/stanfordnlp/stanza/pull/357)

    • Changed Default Pipeline Packages for Several Languages for Improved Robustness: Languages that have changed default packages include: Polish (default is now PDB model, from previous LFG, https://github.com/stanfordnlp/stanza/issues/220), Korean (default is now GSD, from previous Kaist, https://github.com/stanfordnlp/stanza/issues/276), Lithuanian (default is now ALKSNIS, from previous HSE, https://github.com/stanfordnlp/stanza/issues/415).

    • CoreNLP 4.1.0 is required: CoreNLPClient requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.

    • Properties Cache removed from CoreNLP client: The properties_cache has been removed from CoreNLPClient and the CoreNLPClient's annotate() method no longer has a properties_key argument. Python dictionaries with custom request properties should be directly supplied to annotate() via the properties argument.

    Bugfixes and Other Improvements

    • Fixed Logging Behavior: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue https://github.com/stanfordnlp/stanza/issues/278, PR https://github.com/stanfordnlp/stanza/pull/290)

    • Compatibility Fix for PyTorch v1.6.0: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues https://github.com/stanfordnlp/stanza/issues/412 https://github.com/stanfordnlp/stanza/issues/417, PR https://github.com/stanfordnlp/stanza/pull/406)

    • Improved Batching for Long Sentences in Dependency Parser: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue https://github.com/stanfordnlp/stanza/issues/387)

    • Improved neural tokenizer robustness to whitespaces: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR https://github.com/stanfordnlp/stanza/pull/380)

    • Resolved properties issue when switching languages with requests to CoreNLP server: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Apr 27, 2020)

    Overview

    This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

    Enhancements

    • Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

    • Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (https://github.com/stanfordnlp/stanza/issues/227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

    • Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (https://github.com/stanfordnlp/stanza/issues/249). Thanks to @mahdiman for identifying the original issue.

    • Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

    Bugfixes

    • Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (https://github.com/stanfordnlp/stanza/issues/229). Thanks to @RyanElliott10 for reporting the issue.

    • Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (https://github.com/stanfordnlp/stanza/issues/217). Thanks to @aryamccarthy for reporting this issue.

    • Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((https://github.com/stanfordnlp/stanza/issues/231)). Thanks to @Vodkazy for reporting this.

    Known Model Issues & Solutions

    • Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (https://github.com/stanfordnlp/stanza/issues/276). Switching to the Korean GSD model may solve this issue.

    • Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (https://github.com/stanfordnlp/stanza/issues/220). This issue may be solved by switching to the Polish PDB model.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.1(Apr 27, 2020)

    Overview

    This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

    Enhancements

    • Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

    • Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (https://github.com/stanfordnlp/stanza/issues/227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

    • Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (https://github.com/stanfordnlp/stanza/issues/249). Thanks to @mahdiman for identifying the original issue.

    • Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

    Bugfixes

    • Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (https://github.com/stanfordnlp/stanza/issues/229). Thanks to @RyanElliott10 for reporting the issue.

    • Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (https://github.com/stanfordnlp/stanza/issues/217). Thanks to @aryamccarthy for reporting this issue.

    • Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((https://github.com/stanfordnlp/stanza/issues/231)). Thanks to @Vodkazy for reporting this.

    Known Model Issues & Solutions

    • Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (https://github.com/stanfordnlp/stanza/issues/276). Switching to the Korean GSD model may solve this issue.

    • Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (https://github.com/stanfordnlp/stanza/issues/220). This issue may be solved by switching to the Polish PDB model.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 17, 2020)

    Overview

    This is the first major release of Stanza (previously known as StanfordNLP), a software package to process many human languages. The main features of this release are

    • Multi-lingual named entity recognition support. Stanza supports named entity recognition in 8 languages (and 12 datasets): Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The most comprehensive NER models in each language is now part of the default model download of that model, along with other models trained on the largest dataset available.
    • Accurate neural network models. Stanza features highly accurate data-driven neural network models for a wide collection of natural language processing tasks, including tokenization, sentence segmentation, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition.
    • State-of-the-art pretrained models freely available. Stanza features a few hundred pretrained models for 60+ languages, all freely availble and easily downloadable from native Python code. Most of these models achieve state-of-the-art (or competitive) performance on these tasks.
    • Expanded language support. Stanza now supports more than 60 human languages, representing a wide-range of language families.
    • Easy-to-use native Python interface. We've improved the usability of the interface to maximize transparency. Now intermediate processing results are more easily viewed and accessed as native Python objects.
    • Anaconda support. Stanza now officially supports installation from Anaconda. You can install Stanza through Stanford NLP Group's Anaconda channel conda install -c stanfordnlp stanza.
    • Improved documentation. We have improved our documentation to include a comprehensive coverage of the basic and advanced functionalities supported by Stanza.
    • Improved CoreNLP support in Python. We have improved the robustness and efficiency of the CoreNLPClient to access the Java CoreNLP software from Python code. It is also forward compatible with the next major release of CoreNLP.

    Enhancements and Bugfixes

    This release also contains many enhancements and bugfixes:

    • [Enhancement] Improved lemmatization support with proper conditioning on POS tags (#143). Thanks to @nljubesi for the report!
    • [Enhancement] Get the text corresponding to sentences in the document. Access it through sentence.text. (#80)
    • [Enhancement] Improved logging. Stanza now uses Python's logging for all procedual logging, which can be controlled globally either through logging_level or a verbose shortcut. See this page for more information. (#81)
    • [Enhancement] Allow the user to use the Stanza tokenizer with their own sentence split, which might be useful for applications like machine translation. Simply set tokenize_no_ssplit to True at pipeline instantiation. (#108)
    • [Enhancement] Support running the dependency parser only given tokenized, sentence segmented, and POS/morphological feature tagged data. Simply set depparse_pretagged to True at pipeline instantiation. (#141) Thanks @mrapacz for the contribution!
    • [Enhancement] Added spaCy as an option for tokenizing (and sentence segmenting) English text for efficiency. See this documentation page for a quick example.
    • [Enhancement] Add character offsets to tokens, sentences, and spans.
    • [Bugfix] Correctly decide whether to load pretrained embedding files given training flags. (#120)
    • [Bugfix] Google proto buffers reporting errors for long input when using the CoreNLPClient. (#154)
    • [Bugfix] Remove deprecation warnings from newer versions of PyTorch. (#162)

    Breaking Changes

    Note that if your code was developed on a previous version of the package, there are potentially many breaking changes in this release. The most notable changes are in the Document objects, which contain all the annotations for the raw text or document fed into the Stanza pipeline. The underlying implementation of Document and all related data objects have broken away from using the CoNLL-U format as its internal representation for more flexibility and efficiency accessing their attributes, although it is still compatible with CoNLL-U to maintain ease of conversion between the two. Moreover, many properties have been renamed for clarity and sometimes aliased for ease of access. Please see our documentation page about these data objects for more information.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(May 16, 2019)

    This release features major improvements on memory efficiency and speed of the neural network pipeline in stanfordnlp and various bugfixes. These features include:

    • The downloadable pretrained neural network models are now substantially smaller in size (due to the use of smaller pretrained vocabularies) with comparable performance. Notably, the default English model is now ~9x smaller in size, German ~11x, French ~6x and Chinese ~4x. As a result, memory efficiency of the neural pipelines for most languages are substantially improved.

    • Substantial speedup of the neural lemmatizer via reduced neural sequence-to-sequence operations.

    • The neural network pipeline can now take in a Python list of strings representing pre-tokenized text. (https://github.com/stanfordnlp/stanfordnlp/issues/58)

    • A requirements checking framework is now added in the neural pipeline, ensuring the proper processors are specified for a given pipeline configuration. The pipeline will now raise an exception when a requirement is not satisfied. (https://github.com/stanfordnlp/stanfordnlp/issues/42)

    • Bugfix related to alignment between tokens and words post the multi-word expansion processor. (https://github.com/stanfordnlp/stanfordnlp/issues/71)

    • More options are added for customizing the Stanford CoreNLP server at start time, including specifying properties for the default pipeline, and setting all server options such as username/password. For more details on different options, please checkout the client documentation page.

    • CoreNLPClient instance can now be created with CoreNLP default language properties as:

    client = CoreNLPClient(properties='chinese')
    
    • Alternatively, a properties file can now be used during the creation of a CoreNLPClient:
    client = CoreNLPClient(properties='/path/to/corenlp.props')
    
    • All specified CoreNLP annotators are now preloaded by default when a CoreNLPClient instance is created. (https://github.com/stanfordnlp/stanfordnlp/issues/56)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Feb 26, 2019)

    This is a maintenance release of stanfordnlp. This release features:

    • Allowing the tokenizer to treat the incoming document as pretokenized with space separated words in newline separated sentences. Set tokenize_pretokenized to True when building the pipeline to skip the neural tokenizer, and run all downstream components with your own tokenized text. (#24, #34)
    • Speedup in the POS/Feats tagger in evaluation (up to 2 orders of magnitude). (#18)
    • Various minor fixes and documentation improvements

    We would also like to thank the following community members for their contribution: Code improvements: @lwolfsonkin Documentation improvements: @0xflotus And thanks to everyone that raised issues and helped improve stanfordnlp!

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jan 30, 2019)

    The initial release of StanfordNLP. StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software. This package is built with highly accurate neural network components that enables efficient training and evaluation with your own annotated data. The modules are built on top of PyTorch (v1.0.0).

    StanfordNLP features:

    • Native Python implementation requiring minimal efforts to set up;
    • Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;
    • Pretrained neural models supporting 53 (human) languages featured in 73 treebanks;
    • A stable, officially maintained Python interface to CoreNLP.
    Source code(tar.gz)
    Source code(zip)
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

Emre Özkul 1 Feb 23, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021