A deep learning-based translation library built on Huggingface transformers

Overview

DL Translate

A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large

💻 GitHub Repository
📚 Documentation / Readthedocs
🐍 PyPi project
🧪 Colab Demo / Kaggle Demo

Quickstart

Install the library with pip:

pip install dl-translate

To translate some text:

import dl_translate as dlt

mt = dlt.TranslationModel()  # Slow when you load it for the first time

text_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
mt.translate(text_hi, source=dlt.lang.HINDI, target=dlt.lang.ENGLISH)

Above, you can see that dlt.lang contains variables representing each of the 50 available languages with auto-complete support. Alternatively, you can specify the language (e.g. "Arabic") or the language code (e.g. "fr_XX" for French):

text_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
mt.translate(text_ar, source="Arabic", target="fr_XX")

If you want to verify whether a language is available, you can check it:

print(mt.available_languages())  # All languages that you can use
print(mt.available_codes())  # Code corresponding to each language accepted
print(mt.get_lang_code_map())  # Dictionary of lang -> code

Usage

Selecting a device

When you load the model, you can specify the device:

mt = dlt.TranslationModel(device="auto")

By default, the value will be device="auto", which means it will use a GPU if possible. You can also explicitly set device="cpu" or device="gpu", or some other strings accepted by torch.device(). In general, it is recommend to use a GPU if you want a reasonable processing time.

Loading from a path

By default, dlt.TranslationModel will download the model from the huggingface repo and cache it. However, you are free to load from a path:

mt = dlt.TranslationModel("/path/to/your/model/directory/")

Make sure that your tokenizer is also stored in the same directory if you use this approach.

Using a different model

You can also choose another model that has a similar format, e.g.

mt = dlt.TranslationModel("facebook/mbart-large-50-one-to-many-mmt")

Note that the available languages will change if you do this, so you will not be able to leverage dlt.lang or dlt.utils.

Breaking down into sentences

It is not recommended to use extremely long texts as it takes more time to process. Instead, you can try to break them down into sentences with the help of nltk. First install the library with pip install nltk, then run:

import nltk

nltk.download("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(mt.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))

Batch size and verbosity when using translate

It's possible to set a batch size (i.e. the number of elements processed at once) for mt.translate and whether you want to see the progress bar or not:

...
mt = dlt.TranslationModel()
mt.translate(text, source, target, batch_size=32, verbose=True)

If you set batch_size=None, it will compute the entire text at once rather than splitting into "chunks". We recommend lowering batch_size if you do not have a lot of RAM or VRAM and run into CUDA memory error. Set a higher value if you are using a high-end GPU and the VRAM is not fully utilized.

dlt.utils module

An alternative to mt.available_languages() is the dlt.utils module. You can use it to find out which languages and codes are available:

print(dlt.utils.available_languages('mbart50'))  # All languages that you can use
print(dlt.utils.available_codes('mbart50'))  # Code corresponding to each language accepted
print(dlt.utils.get_lang_code_map('mbart50'))  # Dictionary of lang -> code

Advanced

The following section assumes you have knowledge of PyTorch and Huggingface Transformers.

Saving and loading

If you wish to accelerate the loading time the translation model, you can use save_obj:

mt = dlt.TranslationModel()
mt.save_obj('saved_model')
# ...

Then later you can reload it with load_obj:

mt = dlt.TranslationModel.load_obj('saved_model')
# ...

Warning: Only use this if you are certain the torch module saved in saved_model/weights.pt can be correctly loaded. Indeed, it is possible that the huggingface, torch or some other dependencies change between when you called save_obj and load_obj, and that might break your code. Thus, it is recommend to only run load_obj in the same environment/session as save_obj. Note this method might be deprecated in the future once there's no speed benefit in loading this way.

Interacting with underlying model and tokenizer

When initializing model, you can pass in arguments for the underlying BART model and tokenizer (which will respectively be passed to MBartForConditionalGeneration.from_pretrained and MBart50TokenizerFast.from_pretrained):

mt = dlt.TranslationModel(
    model_options=dict(
        state_dict=...,
        cache_dir=...,
        ...
    ),
    tokenizer_options=dict(
        tokenizer_file=...,
        eos_token=...,
        ...
    )
)

You can also access the underlying transformers model and tokenizer:

bart = mt.get_transformers_model()
tokenizer = mt.get_tokenizer()

See the huggingface docs for more information.

bart_model.generate() keyword arguments

When running mt.translate, you can also give a generation_options dictionary that is passed as keyword arguments to the underlying bart_model.generate() method:

mt.translate(
    text,
    source=dlt.lang.GERMAN,
    target=dlt.lang.SPANISH,
    generation_options=dict(num_beams=5, max_length=...)
)

Learn more in the huggingface docs.

Acknowledgement

dl-translate is built on top of Huggingface's implementation of multilingual BART finetuned on many-to-many translation of over 50 languages, which is documented here. The original paper was written by Tang et. al from Facebook AI Research; you can find it here and cite it using the following:

@article{tang2020multilingual,
  title={Multilingual translation with extensible multilingual pretraining and finetuning},
  author={Tang, Yuqing and Tran, Chau and Li, Xian and Chen, Peng-Jen and Goyal, Naman and Chaudhary, Vishrav and Gu, Jiatao and Fan, Angela},
  journal={arXiv preprint arXiv:2008.00401},
  year={2020}
}

dlt is a wrapper with useful utils to save you time. For huggingface's transformers, the following snippet is shown as an example:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."

With dlt, you can run:

import dl_translate as dlt

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

mt = dlt.TranslationModel()
translated_fr = mt.translate(article_hi, source=dlt.lang.HINDI, target=dlt.lang.FRENCH)
translated_en = mt.translate(article_ar, source=dlt.lang.ARABIC, target=dlt.lang.ENGLISH)

Notice you don't have to think about tokenizers, condition generation, pretrained models, and regional codes; you can just tell the model what to translate!

If you are experienced with huggingface's ecosystem, then you should be familiar enough with the example above that you wouldn't need this library. However, if you've never heard of huggingface or mBART, then I hope using this library will give you enough motivation to learn more about them :)

Comments
  • module 'torch' has no attribute 'device'

    module 'torch' has no attribute 'device'

    Hello , @xhlulu Please find attached the part of the tutorial that I tried to execute and where I find the error. NB : I used the guide of Pytorch to install torch according to the command appropriate to my system which is: pip3 install torch torchvision torchaudio . The version of torch is 1.10.1 and my python version is 3.8.5 . image

    image

    Thank you for your help.

    opened by gitassia 9
  • Offline mode tutorial

    Offline mode tutorial

    hi, sorry for my bad English, and I am quite a newbie I am quite confused with the offline tutorial "Now, move everything in the dlt directory to your offline environment. Create a virtual environment:" -where is the "offline environment"? and -how to Create a "virtual environment"? I using windows 11 and python 3.9

    opened by kucingkembar 6
  • error on pyw extention

    error on pyw extention

    hi, it's me again, sorry again for bad English I tried this code in py file, open using python IDLE, run -> run module F5 ===> no problem then rename the extension to pyw, open like exe (double click), and this is the result:

    Traceback (most recent call last):
      File "D:\Script\translate.pyw", line 67, in FB_Loading
        import dl_translate as dlt
      File "C:\Python\Python39\lib\site-packages\dl_translate\__init__.py", line 3, in <module>
        from ._translation_model import TranslationModel
      File "C:\Python\Python39\lib\site-packages\dl_translate\_translation_model.py", line 5, in <module>
        import transformers
      File "C:\Python\Python39\lib\site-packages\transformers\__init__.py", line 43, in <module>
        from . import dependency_versions_check
      File "C:\Python\Python39\lib\site-packages\transformers\dependency_versions_check.py", line 36, in <module>
        from .file_utils import is_tokenizers_available
      File "C:\Python\Python39\lib\site-packages\transformers\file_utils.py", line 58, in <module>
        logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
      File "C:\Python\Python39\lib\site-packages\transformers\utils\logging.py", line 119, in get_logger
        _configure_library_root_logger()
      File "C:\Python\Python39\lib\site-packages\transformers\utils\logging.py", line 82, in _configure_library_root_logger
        _default_handler.flush = sys.stderr.flush
    AttributeError: 'NoneType' object has no attribute 'flush'
    

    any guide to fix this?

    opened by kucingkembar 4
  • Add MarianNMT

    Add MarianNMT

    See Marian: https://huggingface.co/transformers/model_doc/marian.html See helsinki-nlp's models: https://huggingface.co/Helsinki-NLP

    We'd need

    • [ ] Add option to load the marian architecture at initialization (e.g. dlt.TranslationModel("marian"))
    • [ ] Add an option to find all of the languages (and code) available for a certain variant trained using marian, e.g. dlt.utils.available_languages("opus-en-romance")
    • [ ] An option to leverage autocomplete such as dlt.lang.opus.en_romance.ENGLISH, but the options would be limited to only what's available with the variance (i.e. romance)
    • [ ] TBD
    enhancement 
    opened by xhluca 3
  • no load to ram mode

    no load to ram mode

    hi, it me again, and sorry about my bad English, I have a project to use this software for windows tablets with 4GB of ram, the problem is the ram consumption using this software is quite high, about 2,3GB, is there any way to use this software read storage data(SSD or HDD) instead of ram data?

    thank you for reading, and have a nice day

    opened by kucingkembar 2
  • error: when using  torch(1.8.0+cu111)

    error: when using torch(1.8.0+cu111)

    Traceback (most recent call last):
    
      File "translate_test.py", line 66, in <module>
    
        translate_test()
    
      File "translate_test.py", line 30, in translate_test
    
        rest = mt.predict(texts, _from = 'en',batch_size = size)
    
      File "/mnt/eclipse-glority/receipt/deploy/branches/dev/ms_deploy/util/translate_util.py", line 29, in predict
    
        rest = self.mt.translate(texts, source=_from, target=_to, batch_size = batch_size)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/dl_translate/_translation_model.py", line 197, in translate
    
        **encoded, **generation_options
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    
        return func(*args, **kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/transformers/generation_utils.py", line 927, in generate
    
        model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/transformers/generation_utils.py", line 412, in _prepare_encoder_decoder_kwargs_for_generation
    
        model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    
        result = self.forward(*input, **kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 780, in forward
    
        output_attentions=output_attentions,
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    
        result = self.forward(*input, **kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/transformers/models/m2m_100/modeling_m2m_100.py", line 388, in forward
    
        hidden_states = self.activation_fn(self.fc1(hidden_states))
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    
        result = self.forward(*input, **kwargs)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    
        return F.linear(input, self.weight, self.bias)
    
      File "/home/hyj/anaconda3/envs/tf25/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    
        return torch._C._nn.linear(input, weight, bias)
    
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    
    torch                            1.8.0+cu111
    
    torchvision                      0.9.0+cu111
    

    it is ok, when

    torch 1.7.1+cu101

    how to fix ?

    opened by hongyinjie 2
  • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

    Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

    When I use dl_translate, the following problem appears, how do I set TOKENIZERS_PARALLELISM.

    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

    opened by Kouuh 2
  • Incorporating ISO-639

    Incorporating ISO-639

    opened by xhluca 2
  • Cannot run with device = 'gpu' on Macbook M1 Pro

    Cannot run with device = 'gpu' on Macbook M1 Pro

    I have tried to using gpu on Macbook 16inch M1 Pro, then I got this error: "AssertionError: Torch not compiled with CUDA enabled"

    Please help!

    opened by htnha 1
  • how to make (slow) translation faster

    how to make (slow) translation faster

    Hi, I am testing this code on a list of 5 short sentences, the average time for translation is 2 seconds/sentence. which is slow for my requirements. any hints on how to speed-up the translation ? Thanks

    import dl_translate as dlt
    import time 
    
    french_sentence = 'oh mon dieu c mechant c pas possible jamais je reviendrai, a deconseiller. je vous recommende de visiter un autre produit apres vous pouvez voire la difference'
    arabic_sentence = '  لقد جربت عدة نسخ من هذا المنتج لكن لم استطع ان اجد فبه ما ينتج ما هذا الهراء'
    ar2 = 'المنتج الاصلى سريع الذوبان فى الماء ويذوب بشكل مثالى على عكس المكمل المغشوش ...منتج كويس انا حبيتو و بنصح فيه'
    ar3= 'امشي سيدا لفه الثانيه يسار تعدد المطالبات المتعلقة بالأراضي وما ينتج عن ذلك من تناحر يولد باستمرار نزاعات متجددة. ... ويمكن دمج ما ينتج عن ذلك من معارف في إطار برنامج عمل نيروبي' 
    nepali ='यो मृत्युदर विकासशील देशहरुमा धेरै छ'
    sent_list =[french_sentence, arabic_sentence, ar2, ar3, nepali]
    print(sent_list)
    mt = dlt.TranslationModel()  # Slow when you load it for the first time
    map_langdetect_to_translate = {'ar':'Arabic', 'en':'English', 'es':'Spanish', 'fr':'French', 'ne':'Nepali'}
    start = time.time() 
    for sent in sent_list:
    	print('-------------------------------------')
    	print('original sentence is : ',sent)
    	print('detected lang ',detect(sent))
    	mapped = map_langdetect_to_translate[detect(sent)]
    	translated = mt.translate(sent, source=mapped, target="en")
    	print('Translation is : ',translated)
    
    end = time.time()	
    tt = time.strftime("%H:%M:%S", time.gmtime(end-start))
    time_message = 'Query execution time : {}'.format( tt )
    print(time_message)
    
    opened by banyous 1
  • Generate docs with sphinx or something else

    Generate docs with sphinx or something else

    Right now I have some docstrings but it would require some refactoring. Using readthedocs.io would be nice, we could start by looking at what numpy or pydata is using

    documentation 
    opened by xhluca 1
  • Detect source language with langdetect package

    Detect source language with langdetect package

    The langdetect has worked well for me in the past for language detection problems. How would you feel about allowing users to pass 'auto' as an option for source? I could see some pros and cons:

    Pros

    • Users don't need to be able to recognize a language to translate
    • Eliminates pre-classification of languages if your dataset contains multiple languages

    Cons

    I'm a little new to open source but I would love to contribute 🙂 Of course, if you feel this doesn't fit this package's mission that's totally understandable.

    enhancement help wanted good first issue 
    opened by awalker88 5
  • Support for sentence splitting

    Support for sentence splitting

    Right now TranslationModel.translate will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:

    import nltk
    
    nltk.load("punkt")
    
    text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
    sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
    " ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))
    

    Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. model.translate(..., max_length="sentence"). With some more effort, max_length parameter could also be an integer n between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length n or less, which would include the aggregated sentences.

    enhancement help wanted 
    opened by xhluca 3
Releases(v0.2.6)
  • v0.2.6(Jul 13, 2022)

  • v0.2.2.post1(Aug 21, 2021)

  • v0.2.2(Apr 9, 2021)

    Change languages available in dlt.lang

    Changed

    • Docs: Available languages now include "Khmer" (which maps to central khmer)

    Fixed

    • dlt.lang will now have all the languages corresponding to m2m100 instead of mbart50
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Apr 8, 2021)

    Fix dlt.TranslationModel.load_obj

    Added

    • New tests for saving and loading.

    Fixed

    • dlt.TranslationModel.load_obj: Will now work without having to explicitly give the model family.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Apr 8, 2021)

    Add m2m100 as the new default model to support 100 languages

    Added

    • dlt.lang.m2m100 module: Now has variables for over 100 languages, also auto-complete ready. Example: dlt.lang.m2m100.ENGLISH.
    • dlt.utils.available_languages, dlt.utils.available_codes: Now supports argument "m2m100"
    • Available languages for each model family
    • Script and template to generate available languages

    Changed

    • [BREAKING] dlt.lang.TranslationModel: A new model parameter called model_family in the initialization function. Either "mbart50" or "m2m100". By default, it will be inferred based on model_or_path. Needs to be explicitly set if model_or_path is a path.
    • [BREAKING] Default model changed to m2m100
    • Docs and readme about mbart50 were reframed to take into account the new model
    • dlt.TranslationModel.translate: Improved docstring to be more general.
    • Tests pertaining to m2m100
    • scripts/generate_langs.py: Renamed, mechanism now changed to loading from json files
    • docs/index.md: Expand the "Usage" and "Advanced" sections
    • README.md: Add acknowledgement about m2m100, significantly trim "Advanced" section, make "Usage" more concise

    Fixed

    • dlt.TranslationModel.available_codes() was returning the languages instead of the codes. It will now correctly return the code.

    Removed

    • Output type hints for TranslationModel.get_transformers_model and TranslationModel.get_tokenizer
    • [BREAKING] dlt.TranslationModel.bart_model and dlt.TranslationModel.tokenizer are no longer available to be used directly. Please use dlt.TranslationModel.get_transformers_model and dlt.TranslationModel.get_tokenizer instead.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0rc1(Mar 21, 2021)

    Add m2m100 as an alternative to mbart50

    m2m100 has more languages available (~110) and has also reported their absolute BLEU scores.

    Added

    • dlt.lang.m2m100 module: Now has variables for over 100 languages, also auto-complete ready. Example: dlt.lang.m2m100.ENGLISH.
    • dlt.utils.available_languages, dlt.utils.available_codes: Now supports argument "m2m100"

    Changed

    • [BREAKING] dlt.lang.TranslationModel: A new model parameter called model_family in the initialization function. Either "mbart50" or "m2m100". By default, it will be inferred based on model_or_path. Needs to be explicitly set if model_or_path is a path.
    • dlt.TranslationModel.translate: Improved docstring to be more general.
    • Tests pertaining to m2m100
    • scripts/generate_langs.py: Renamed, mechanism now changed to loading from json files

    Fixed

    • dlt.TranslationModel.available_codes() was returning the languages instead of the codes. It will now correctly return the code.

    Removed

    • Output type hints for TranslationModel.get_transformers_model and TranslationModel.get_tokenizer
    • [BREAKING] dlt.TranslationModel.bart_model and dlt.TranslationModel.tokenizer are no longer available to be used directly. Please use dlt.TranslationModel.get_transformers_model and dlt.TranslationModel.get_tokenizer instead.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 17, 2021)

nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022