Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Overview

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

Trankit outperforms the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages while still being efficient in memory usage and speed, making it usable for general users.

In particular, for English, Trankit is significantly better than Stanza on sentence segmentation (+7.22%) and dependency parsing (+3.92% for UAS and +4.37% for LAS). For Arabic, our toolkit substantially improves sentence segmentation performance by 16.16% while Chinese observes 12.31% and 12.72% improvement of UAS and LAS for dependency parsing. Detailed comparison between Trankit, Stanza, and other popular NLP toolkits (i.e., spaCy, UDPipe) in other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Technical details about Trankit are presented in our following paper. Please cite the paper if you use Trankit in your research.

@misc{nguyen2021trankit,
      title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing}, 
      author={Minh Nguyen and Viet Lai and Amir Pouran Ben Veyseh and Thien Huu Nguyen},
      year={2021},
      eprint={2101.03289},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically. Note that, due to this issue relating to adapter-transformers which is an extension of the transformers library, users may need to uninstall transformers before installing trankit to avoid potential conflicts.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and install Trankit.

Usage

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

  • Sentence segmentation.
  • Tokenization.
  • Multi-word token expansion.
  • Part-of-speech tagging.
  • Morphological feature tagging.
  • Dependency parsing.
  • Named entity recognition.

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically download pretrained models, and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Perform all tasks on the input

After initializing a pretrained pipeline, it can be used to process the input on all tasks as shown below. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion? of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please check out our documentation page.

Multilingual usage

In case we want to process inputs of different languages, we need to initialize a multilingual pipeline.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize an English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for an Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

In this example, .set_active() is used to switch between languages.

Building a customized pipeline

Training customized pipelines is easy with Trankit via the class TPipeline. Below we show how we can train a token and sentence splitter on customized data.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training and loading a customized pipeline can be found here

To-do list

  • Language Identification

Acknowledgements

We use XLM-Roberta and Adapters as our shared multilingual encoder for different tasks and languages. The AdapterHub is used to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza.

Comments
  • File is not a zip file Error when loading the pretrained model

    File is not a zip file Error when loading the pretrained model

    Hi I was trying the customized ner tutorial notebook

    When I ran code

    trankit.verify_customized_pipeline(`
        category='customized-ner', # pipeline category
        save_dir='./save_dir_filtered' # directory used for saving models in previous steps
    )
    

    It printed "Customized pipeline is ready to use". However when I loaded the pipeline as the instruction, it kept reporting error message: /usr/local/lib/python3.7/dist-packages/trankit/utils/base_utils.py in download(cache_dir, language, saved_model_version, embedding_name) BadZipFile: File is not a zip file.

    Can you help me to figure out what did I miss, and how to fix this?

    opened by Yichen-fqyd 7
  • Difficulties in reproducing the GermEval14 NER model

    Difficulties in reproducing the GermEval14 NER model

    Hi again @minhvannguyen,

    I am sorry to bother you once again but I was wondering whether you could provide a bit more information on how might one reproduce the trankit results on GermEval14, which are presented in the trankit paper.

    Baed on your suggestion in #6 I tried to train a trankit-based NER model on the GermEval14 data by directly passing it to trankit.TPipeline. You can find the (very simple) code that sets up the environment, prepares the data and trains the model in the following Colab.

    In the paper, Table 3 reports the test F1 score on this dataset at 86.9 but even after running over 80 training epochs, the best dev F1 score I managed to receive was on 1.74 and it does not seem like the evaluation on the test set would produce vastly different results.

    Hence, my preliminary confusion is that I must be doing something wrong. One of the first suspects would be random seeds but those seems to be fixed as we can see in the snippet below: https://github.com/nlp-uoregon/trankit/blob/b7e4a3bc25d564b3b2870a2b03a5aa4fc9a38c9a/trankit/tpipeline.py#L112-L119

    I was therefore wondering whether you could immediately see what I am doing wrong here, or generally provide some pointers that could be helpful in reproducing the results listed in the paper.

    Thanks!

    opened by mrshu 6
  • Format for training custom NER classifiers

    Format for training custom NER classifiers

    First of all, thanks for opensourcing trankit -- it looks very interesting!

    I would be interested in training a custom NER model as described in the docs. Could you please comment a bit on what sort of a format should the .bio files be stored in?

    Thanks!

    cc @minhvannguyen

    opened by mrshu 6
  • Compatibility issue when using with newer Transformers library

    Compatibility issue when using with newer Transformers library

    I'm running into an issue when trying to use trankit in a project which use a new version of huggingface transformers library. Trankit depends on adapter-transformers, which cannot be simutanously used with transformers

    opened by CaoHoangTung 6
  • error on

    error on "from trankit import Pipeline "

    Thanks for providing this great toolkit. But, I cannot import Pipeline and get the following error: ImportError: cannot import name '_BaseLazyModule' from 'transformers.file_utils'

    It could be because of the conflict in versions. When I did "pip install trankit", I got this error at the end:

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    transformers 4.2.1 requires tokenizers==0.9.4, but you have tokenizers 0.9.3 which is incompatible.
    Successfully installed tokenizers-0.9.3
    

    I really appreciate your help on this.

    good first issue 
    opened by mzolfaghari 6
  • Question on pre-tokenized input

    Question on pre-tokenized input

    In my case, I need to use bert to tokenize sentences and use trankit with the tokenized sentence to calculate the dependency relations. I want to know whether trankit will have performance loss with the pre-tokenized sentence?

    opened by eliasyin 4
  • Problem in long sentences?

    Problem in long sentences?

    Hi,

    we occasionally have a problem with long sentences.

    Traceback (most recent call last):
      File "test_trankit.py", line 25, in <module>
        parsed = p(parse_me)
      File "/home/jesse/.local/lib/python3.7/site-packages/trankit/pipeline.py", line 916, in __call__
        out = self._ner_doc(out)
      File "/home/jesse/.local/lib/python3.7/site-packages/trankit/pipeline.py", line 873, in _ner_doc
        word_reprs, cls_reprs = self._embedding_layers.get_tagger_inputs(batch)
      File "/home/jesse/.local/lib/python3.7/site-packages/trankit/models/base_models.py", line 68, in get_tagger_inputs
        word_lens=batch.word_lens
      File "/home/jesse/.local/lib/python3.7/site-packages/trankit/models/base_models.py", line 43, in encode_words
        idxs) * masks  # this might cause non-deterministic results during training, consider using `compute_word_reps_avg` in that case
    

    Example code:

    from trankit import Pipeline
    import json
    
    
    p = Pipeline(lang='french', gpu=False, cache_dir='./cache')
    
    ######## document-level processing ########
    
    sentences = [
    ['Bacquelaine', 'Daniel', ',', 'Battheu', 'Sabien', ',', 'Bauchau', 'Marie', ',', 'Becq', 'Sonja', ',', 'Ben', 'Hamou', 'Nawal', ',', 'Blanchart', 'Philippe', ',', 'Bogaert', 'Hendrik', ',', 'Bonte', 'Hans', ',', 'Brotcorne', 'Christian', ',', 'Burton', 'Emmanuel', ',', 'Caprasse', 'Véronique', ',', 'Ceysens', 'Patricia', ',', 'Clarinval', 'David', ',', 'Daerden', 'Frédéric', ',', 'De', 'Block', 'Maggie', ',', 'De', 'Coninck', 'Monica', ',', 'De', 'Crem', 'Pieter', ',', 'De', 'Croo', 'Alexander', ',', 'Delannois', 'Paul-Olivier', ',', 'Delizée', 'Jean-Marc', ',', 'Delpérée', 'Francis', ',', 'Demeyer', 'Willy', ',', 'Demon', 'Franky', ',', 'Deseyn', 'Roel', ',', 'Detiège', 'Maya', ',', 'Dewael', 'Patrick', ',', 'Dierick', 'Leen', ',', 'Di', 'Rupo', 'Elio', ',', 'Dispa', 'Benoît', ',', 'Ducarme', 'Denis', ',', 'Fernandez', 'Fernandez', 'Julia', ',', 'Flahaut', 'André', ',', 'Flahaux', 'Jean-Jacques', ',', 'Fonck', 'Catherine', ',', 'Foret', 'Gilles', ',', 'Frédéric', 'André', ',', 'Fremault', 'Céline', ',', 'Friart', 'Benoît', ',', 'Geens', 'Koenraad', ',', 'Geerts', 'David', ',', 'Goffin', 'Philippe', ',', 'Grovonius', 'Gwenaelle', ',', 'Heeren', 'Veerle', ',', 'Jadin', 'Kattrin', ',', 'Jiroflée', 'Karin', ',', 'Kir', 'Emir', ',', 'Kitir', 'Meryame', ',', 'Laaouej', 'Ahmed', ',', 'Lachaert', 'Egbert', ',', 'Lalieux', 'Karine', ',', 'Lanjri', 'Nahima', ',', 'Lijnen', 'Nele', ',', 'Lutgen', 'Benoît', ',', 'Mailleux', 'Caroline', ',', 'Maingain', 'Olivier', ',', 'Marghem', 'Marie-Christine', ',', 'Massin', 'Eric', ',', 'Mathot', 'Alain', ',', 'Matz', 'Vanessa', ',', 'Michel', 'Charles', ',', 'Muylle', 'Nathalie', ',', 'Onkelinx', 'Laurette', ',', 'Özen', 'Özlem', ',', 'Pehlivan', 'Fatma', ',', 'Piedboeuf', 'Benoit', ',', 'Pirlot', 'Sébastian', ',', 'Pivin', 'Philippe', ',', 'Poncelet', 'Isabelle', ',', 'Reynders', 'Didier', ',', 'Schepmans', 'Françoise', ',', 'Senesael', 'Daniel', ',', 'Smaers', 'Griet', ',', 'Somers', 'Ine', ',', 'Temmerman', 'Karin', ',', 'Terwingen', 'Raf', ',', 'Thiébaut', 'Eric', ',', 'Thiéry', 'Damien', ',', 'Thoron', 'Stéphanie', ',', 'Top', 'Alain', ',', 'Turtelboom', 'Annemie', ',', 'Van', 'Biesen', 'Luk', ',', 'Van', 'Cauter', 'Carina', ',', 'Vande', 'Lanotte', 'Johan', ',', 'Van', 'den', 'Bergh', 'Jef', ',', 'Vandenput', 'Tim', ',', 'Van', 'der', 'Maelen', 'Dirk', ',', 'Vanheste', 'Ann', ',', 'Van', 'Mechelen', 'Dirk', ',', 'Van', 'Quickenborne', 'Vincent', ',', 'Van', 'Rompuy', 'Eric', ',', 'Vanvelthoven', 'Peter', ',', 'Vercamer', 'Stefaan', ',', 'Verherstraeten', 'Servais', ',', 'Wathelet', 'Melchior', ',', 'Winckel', 'Fabienne', ',', 'Yüksel', 'Veli'],
    
    ['HR', 'Rail', 'organise', 'des', 'actions', 'pour', 'attirer', 'un', 'maximum', 'de', 'candidats', 'vers', 'le', 'métier', 'du', 'rail.', 'À', 'ce', 'titre', ',', 'elle', 'organise', 'des', 'dizaines', 'de', 'job', 'days', ',', 'participe', 'à', 'plusieurs', 'dizaines', 'de', 'salons', 'de', "l'", 'emploi', ',', 'organise', 'énormément', 'de', 'visites', "d'", 'écoles', 'et', 'amène', 'un', 'grand', 'nombre', "d'", 'étudiants', 'à', 'visiter', 'les', 'ateliers', 'de', 'la', 'SNCB', 'et', "d'", 'Infrabel', ',', 'met', 'sur', 'pied', 'des', 'concours', ',', 'est', 'présente', 'dans', 'les', 'médias', 'sociaux', '(', 'LinkedIn', ',', 'Facebook', ',', 'etc', '.)', 'ainsi', 'que', 'dans', 'les', 'médias', 'classiques', '(', 'à', 'la', 'télévision', 'et', 'dans', 'les', 'cinémas', 'en', 'Flandre', '),', 'lance', 'des', 'actions', 'telles', 'que', 'Refer', 'a', 'friend', ',', 'a', 'lancé', 'début', '2016', ',', 'en', 'collaboration', 'avec', 'les', 'services', 'Communication', 'de', 'la', 'SNCB', 'et', "d'", 'Infrabel', ',', 'une', 'toute', 'nouvelle', 'campagne', "d'", 'image', '"', 'Hier', 'ton', 'rêve', ',', "aujourd'", 'hui', 'ton', 'job', '",', 'réactualise', 'son', 'site', 'internet', 'dédié', 'au', 'recrutement', '(', 'www.lescheminsdeferengagent.be', '),', 'a', 'développé', 'un', 'simulateur', 'de', 'train', 'et', 'de', 'train', 'technique', 'utilisé', 'lors', 'des', 'job', 'events', 'et', 'disponible', 'sur', 'le', 'site', 'internet', 'en', 'tant', "qu'", 'application', 'Android', 'et', 'IOS', ',', 'participe', 'à', 'différents', 'projets', 'de', 'formation', 'avec', 'le', 'VDAB', 'et', 'le', 'FOREM', ',', 'a', 'organisé', 'différentes', 'actions', "d'", 'été', 'dans', 'les', 'gares', 'pour', 'sensibiliser', 'le', 'public', 'aux', 'métiers', 'ferroviaires', ',', 'développe', 'des', 'actions', 'en', 'faveur', 'de', 'la', 'diversité', ',', 'a', 'lancé', 'le', 'pelliculage', 'de', 'certains', 'trains', 'en', 'faveur', 'de', 'son', 'site', 'internet', 'et', 'de', 'son', 'recrutement', ',', 'organisera', 'début', '2017', 'le', 'train', 'de', "l'", 'emploi', '.'],
    
    ['Les', 'données', 'de', 'la', 'banque', 'transmises', 'aux', 'équipes', 'de', 'recherche', 'sont', 'le', 'numéro', 'du', 'dossier', ',', 'la', 'langue', ',', "l'", 'âge', 'du', 'patient', ',', 'le', 'sexe', 'du', 'patient', ',', 'le', 'lieu', 'du', 'décès', '(', 'à', 'domicile', ',', 'à', "l'", 'hôpital', ',', 'dans', 'une', 'maison', 'de', 'repos', 'et', 'de', 'soins', 'ou', 'autre', '),', 'la', 'base', 'de', "l'", 'euthanasie', '(', 'demande', 'actuelle', 'ou', 'déclaration', 'anticipée', '),', 'la', 'catégorie', "d'", 'affection', 'selon', 'la', 'classification', 'de', "l'", 'OMS', ',', 'le', 'code', 'ICD-10', '(', 'par', 'exemple', ',', 'tumeur', '),', 'la', 'sous-catégorie', "d'", 'affection', 'à', 'la', 'base', 'de', 'la', 'demande', "d'", 'euthanasie', ',', 'selon', 'la', 'classification', 'de', "l'", 'OMS', '(', 'par', 'exemple', ',', 'tumeur', 'maligne', 'du', 'sein', '),', "l'", 'information', 'complémentaire', '(', 'présence', 'de', 'métastases', ',', 'de', 'dépression', ',', 'de', 'cancer', '),', "l'", 'échéance', 'de', 'décès', '(', 'bref', 'ou', 'non', 'bref', '),', 'la', 'qualification', 'du', 'premier', 'médecin', 'consulté', 'dans', 'tous', 'les', 'cas', '(', 'un', 'généraliste', ',', 'un', 'spécialiste', ',', 'un', 'médecin', 'palliatif', '),', 'la', 'qualification', 'du', 'second', 'médecin', 'consulté', 'en', 'cas', 'de', 'décès', ',', 'non', 'prévu', 'à', 'brève', 'échéance', '(', 'psychiatre', 'ou', 'spécialiste', '),', "l'", 'autre', 'personne', 'ou', "l'", 'instance', 'consultée', '(', 'médecin', 'ou', 'psychologue', ',', "l'", 'équipe', 'palliative', 'ou', 'autre', '),', 'le', 'type', 'de', 'souffrance', '(', 'psychique', 'ou', 'physique', '),', 'la', 'méthode', 'et', 'les', 'produits', 'utilisés', '(', 'le', 'thiopental', 'seul', ',', 'le', 'thiopental', 'avec', 'le', 'curare', ',', 'des', 'barbituriques', 'ou', 'autres', 'médicaments', '),', 'la', 'décision', 'de', 'la', 'Commission', '(', 'ouverture', 'pour', 'remarques', ',', 'ou', 'pour', 'plus', "d'", 'informations', 'sur', 'les', 'conditions', 'ou', 'la', 'procédure', 'suivie', '),', 'la', 'transmission', 'ou', 'non', 'à', 'la', 'justice', '.'],
    
    ['Monsieur', 'le', 'ministre', ',', 'l’', 'article', '207', ',', 'alinéa', '7', 'du', 'Code', 'des', 'impôts', 'sur', 'les', 'revenus', '(', 'CIR', ')', 'mentionne', 'qu’', 'aucune', 'de', 'ces', 'déductions', 'ou', 'compensations', 'avec', 'la', 'perte', 'de', 'la', 'période', 'imposable', 'ne', 'peut', 'être', 'opérée', 'sur', 'la', 'partie', 'du', 'résultat', 'qui', 'provient', "d'", 'avantages', 'anormaux', 'ou', 'bénévoles', 'visés', 'à', "l'", 'article', '79', ',', 'ni', 'sur', 'les', 'avantages', 'financiers', 'ou', 'de', 'toute', 'nature', 'reçus', 'visés', 'à', "l'", 'article', '53', ',', '24°', ',', 'ni', 'sur', "l'", 'assiette', 'de', 'la', 'cotisation', 'distincte', 'spéciale', 'établie', 'sur', 'les', 'dépenses', 'ou', 'les', 'avantages', 'de', 'toute', 'nature', 'non', 'justifiés', ',', 'conformément', 'à', "l'", 'article', '219', ',', 'ni', 'sur', 'la', 'partie', 'des', 'bénéfices', 'qui', 'sont', 'affectés', 'aux', 'dépenses', 'visées', 'à', "l'", 'article', '198', ',', '§', '1er', ',', '9°', ',', '9°', 'bis', 'et', '12°', ',', 'ni', 'sur', 'la', 'partie', 'des', 'bénéfices', 'provenant', 'du', 'non-respect', 'de', "l'", 'article', '194quater', ',', '§', '2', ',', 'alinéa', '4', 'et', 'de', "l'", 'application', 'de', "l'", 'article', '194quater', ',', '§', '4', ',', 'ni', 'sur', 'les', 'dividendes', 'visés', 'à', "l'", 'article', '219ter', ',', 'ni', 'sur', 'la', 'partie', 'du', 'résultat', 'qui', 'fait', "l'", 'objet', "d'", 'une', 'rectification', 'de', 'la', 'déclaration', 'visée', 'à', "l'", 'article', '346', 'ou', "d'", 'une', 'imposition', "d'", 'office', 'visée', 'à', "l'", 'article', '351', 'pour', 'laquelle', 'des', 'accroissements', "d'", 'un', 'pourcentage', 'égal', 'ou', 'supérieur', 'à', '10', '%', 'visés', 'à', "l'", 'article', '444', 'sont', 'effectivement', 'appliqués', ',', 'à', "l'", 'exception', 'dans', 'ce', 'dernier', 'cas', 'des', 'revenus', 'déductibles', 'conformément', 'à', "l'", 'article', '205', ',', '§', '2', '.'],
    ]
    
    for s in sentences:
      print(" ".join(s))
      parse_me = [s]
      parsed = p(parse_me)
    
    opened by JessedeDoes 4
  • Question: running in parallel

    Question: running in parallel

    Hey guys,

    Starting to use your library, which is pretty cool! Thanks a lot ! However, I'm trying to process a lot of document ~400k and as you can guess it will take quite some time 😅 . I'm working with pandas dataframe and I tried to use pandarallel to try running things in parallel but I didn't manage to have it to work. Seemed like it was stuck forever..

    Do you have any if there's a way I could leverage parallelisation (or anything else other than GPU) to reduce computation time?

    Thanks in advance!

    opened by JulesBelveze 4
  • Torch version issue

    Torch version issue

    I was having an issue using trankit with my GPU since I had an incompatible version of pytorch (1.9.0+cu111).

    trankit currently requires torch<1.8.0,>=1.6.0.

    Is there a reason for this dependency lock or could it be expanded to include torch==1.9.0? I've built from source with 1.9.0 and everything seems to be working. I'd be happy to make a PR with the version bump.

    opened by kpister 3
  • Import error after fresh install

    Import error after fresh install

    I'm having some trouble installing trankit.

    Created a new venv on python 3.8

    $ pip install trankit
    $ python
    Python 3.8.6 (default, Oct 21 2020, 08:28:24)
    [Clang 11.0.0 (clang-1100.0.33.12)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from trankit import Pipeline
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "vtrankit/lib/python3.8/site-packages/trankit/__init__.py", line 1, in <module>
        from .pipeline import Pipeline
      File "vtrankit/lib/python3.8/site-packages/trankit/pipeline.py", line 2, in <module>
        from .models.base_models import Multilingual_Embedding
      File "vtrankit/lib/python3.8/site-packages/trankit/models/__init__.py", line 1, in <module>
        from .classifiers import *
      File "vtrankit/lib/python3.8/site-packages/trankit/models/classifiers.py", line 2, in <module>
        from .base_models import *
      File "vtrankit/lib/python3.8/site-packages/trankit/models/base_models.py", line 1, in <module>
        from transformers import AdapterType, XLMRobertaModel
      File "vtrankit/lib/python3.8/site-packages/transformers/__init__.py", line 672, in <module>
        from .trainer import Trainer
      File "vtrankit/lib/python3.8/site-packages/transformers/trainer.py", line 69, in <module>
        from .trainer_pt_utils import (
      File "vtrankit/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 40, in <module>
        from torch.optim.lr_scheduler import SAVE_STATE_WARNING
    ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (vtrankit/lib/python3.8/site-packages/torch/optim/lr_scheduler.py)
    
    $ pip freeze
    adapter-transformers==1.1.1
    certifi==2020.12.5
    chardet==4.0.0
    click==7.1.2
    filelock==3.0.12
    idna==2.10
    joblib==1.0.1
    numpy==1.20.1
    packaging==20.9
    protobuf==3.15.5
    pyparsing==2.4.7
    regex==2020.11.13
    requests==2.25.1
    sacremoses==0.0.43
    sentencepiece==0.1.91
    six==1.15.0
    tokenizers==0.9.3
    torch==1.8.0
    tqdm==4.58.0
    trankit==0.3.5
    typing-extensions==3.7.4.3
    urllib3==1.26.3
    

    I've been looking around, the same error happened here. Not sure what is happening, but seems like my pytorch version is too new? The setup.py for trankit specifies torch>=1.6.1.

    opened by kpister 3
  • Feature request: langID in multilingual pipelines

    Feature request: langID in multilingual pipelines

    Thanks for this framework! It could be worth to add a language identification task to avoid using p.set_active( lang ). For langId a very robust, fast and tiny one could be FastText or a BERT model (better integration, but computational intensive) So the multilingual pipeline would become:

    from trankit import Pipeline
    p = Pipeline(lang='auto', gpu=True, cache_dir='./cache') # auto means language identification active
    p.tokenize('Rich was here before the scheduled time.')
    p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')
    
    opened by loretoparisi 3
  • How to get children from a particular token in trankit

    How to get children from a particular token in trankit

    In spacy you have a feature called token.children. Do we have something like that in trankit ? Eg:- Screenshot from 2022-12-13 17-47-30 In dependency parsing, we can see that the verb "shift" points to "cars", "liability" and "manufacturers" . In spacy if I give shift.children , I will get cars , liability , manufacturers.

    Do we have something similar in trankit ?

    opened by HitheshSankararaman 0
  • GPU on Apple M1 chip support

    GPU on Apple M1 chip support

    This is a feature request to add support for the Apple M1 chip, which is supported by PyTorch since v1.12.

    Currently, Trankit only seems to use Cuda:

    In [9]: from trankit import Pipeline
    
    In [10]: p = Pipeline(lang='english')
    Loading pretrained XLM-Roberta, this may take a while...
    Loading tokenizer for english
    Loading tagger for english
    Loading lemmatizer for english
    Loading NER tagger for english
    ==================================================
    Active language: english
    ==================================================
    
    In [11]: p._use_gpu
    Out[11]: False
    

    Confirming that MPS is available through PyTorch:

    In [12]: import torch
    
    In [13]: torch.has_mps
    Out[13]: True
    

    A look into pipeline.py shows that it only works on CUDA:

        def _setup_config(self, lang):
            torch.cuda.empty_cache()
            # decide whether to run on GPU or CPU
            if self._gpu and torch.cuda.is_available():
                self._use_gpu = True
    
    opened by carschno 0
  • Error in tesing phase after training and creating a pipeline (posdep)

    Error in tesing phase after training and creating a pipeline (posdep)

    I am struggling with one of the functionalities of running the pipeline on the test data. The pipeline is able to predict a single line but is not working properly on the dataset and give a CoNLL-U score for posdep. The error is as follows:

        539             head_seqs = [chuliu_edmonds_one_root(adj[:l, :l])[1:] for adj, l in
        540                          zip(predicted_dep[0], sentlens)]
    --> 541             deprel_seqs = [[self._config.itos[DEPREL][predicted_dep[1][i][j + 1][h]] for j, h in
        542                             enumerate(hs)] for
        543                            i, hs
    
    • The code link is: https://colab.research.google.com/drive/1oUdBkJDIrnHR6fhdEOeIevBoTCrG_PF2?usp=sharing

    • The relevant data files are:

    1. dev-conllu.dat.filtered: https://drive.google.com/file/d/1-FT8aRhNmy0FADRmvG2_fdf4iphTLojI/view?usp=sharing
    2. train-conllu.dat.filtered: https://drive.google.com/file/d/1-8uoFLG9WSP6X3EQq7akxn9SWlZJHvZl/view?usp=sharing
    3. test-conllu.dat.filtered: https://drive.google.com/file/d/1-EJiXmmDnxMaa2JZ_EPkqdcZoom-4kss/view?usp=sharing
    opened by Jeevesh28 0
  • Query about Java Based compatibility or ONNX model availability of trankit sentence segmentation module

    Query about Java Based compatibility or ONNX model availability of trankit sentence segmentation module

    Hi,

    It's been an amazing package for sentence segmentation particularly . We have been trying to incorporate as part of our development efforts and running some POCs and evaluations .

    We would like to know if trankit segmentation module can be used with JAVA or any ONNX model available for it ? . Not being able to use it as java library or ONNX model is stopping us from evaluating it as part of our pipeline.

    It would be of great help to get any pointers on this aspect .

    opened by janmejay03 0
  • Limit input string to 512 characters to avoid CUDA crash

    Limit input string to 512 characters to avoid CUDA crash

    Problem

    # If
    assert len(sentence) > 512
    # then
    annotated = model_trankit(sentence, is_sent=True)
    # result in CUDA error, e.g.
    ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [19635,0,0], thread: [112,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
    

    Cause XLM-Roberta can only process 512 characters.

    Possible fix https://github.com/nlp-uoregon/trankit/blob/1c19b9b7df3be1de91c2dd6879e0e325af5e2898/trankit/pipeline.py#L1066

    Change

    ...
    
                    ori_text = deepcopy(input)
                    tagged_sent = self._posdep_sent(input)
    ...
    

    to

    ...
    
                    ori_text = deepcopy(input)
                    ori_text = ori_text[:512]   # <<< TRIM STRING TO MAX 512
                    tagged_sent = self._posdep_sent(input)
    ...
    
    opened by ulf1 1
  • Hardware recommendations

    Hardware recommendations

    Hello. :) I'm runnning multiple versions of trankit in docker containers (each container is assigned a rtx3090 or a rtx2080Ti), with 3 python instances/workers per GPU/container.

    I'm seeing performance throughput drop off beyond about 3 gpus on a dual 2697 v3 machine (dual 16 core processors, single thread passmark about 2000, multi 20k per cpu), and for a single gpu, performance is about 15% lower than on a 5950x machine (16 cores, single thread passmark about 3500).

    I'm still doing some tests, but, seems like trankit likes fast cpu cores (seems like 4-5 per gpu) to run well?

    opened by hobodrifterdavid 0
Releases(v1.1.0)
  • v1.1.0(Jun 19, 2021)

    • The issue #17 of loading customized pipelines has been fixed in this new release. Please check it out here.
    • In this new release, trankit supports conversion of trankit outputs in json format to CoNLL-U format. The conversion is done via the new function trankit2conllu, which can be used as belows:
    from trankit import Pipeline, trankit2conllu
    
    p = Pipeline('english')
    
    # document level
    json_doc = p('''Hello! This is Trankit.''')
    conllu_doc = trankit2conllu(json_doc)
    print(conllu_doc)
    #1       Hello   hello   INTJ    UH      _       0       root    _       _
    #2       !       !       PUNCT   .       _       1       punct   _       _
    #
    #1       This    this    PRON    DT      Number=Sing|PronType=Dem        3       nsubj   _       _
    #2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   3       cop     _       _
    #3       Trankit Trankit PROPN   NNP     Number=Sing     0       root    _       _
    #4       .       .       PUNCT   .       _       3       punct   _       _
    
    # sentence level
    json_sent = p('''This is Trankit.''', is_sent=True)
    conllu_sent = trankit2conllu(json_sent)
    print(conllu_sent)
    #1       This    this    PRON    DT      Number=Sing|PronType=Dem        3       nsubj   _       _
    #2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   3       cop     _       _
    #3       Trankit Trankit PROPN   NNP     Number=Sing     0       root    _       _
    #4       .       .       PUNCT   .       _       3       punct   _       _
    
    
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 31, 2021)

    :boom: :boom: :boom: Trankit v1.0.0 is out:

    • 90 new pretrained transformer-based pipelines for 56 languages. The new pipelines are trained with XLM-Roberta large, which further boosts the performance significantly over 90 treebanks of the Universal Dependencies v2.5 corpus. Check out the new performance here. This page shows you how to use the new pipelines.

    • Auto Mode for multilingual pipelines. In the Auto Mode, the language of the input will be automatically detected, enabling the multilingual pipelines to process the input without specifying its language. Check out how to turn on the Auto Mode here. Thank you loretoparisi for your suggestion on this.

    • Command-line interface is now available to use. This helps users who are not familiar with Python programming language can use Trankit easily. Check out the tutorials on this page.

    Source code(tar.gz)
    Source code(zip)
Owner
This is the official github account for Natural Language Processing Group at the University of Oregon.
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text

Masatoshi Suzuki 44 Oct 19, 2022