FireFlyer Record file format, writer and reader for DL training samples.

Overview

FFRecord

The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux Asynchronous Input/Output (AIO) read.

File Format

Storage Layout:

+-----------------------------------+---------------------------------------+
|         checksum                  |             N                         |
+-----------------------------------+---------------------------------------+
|         checksums                 |           offsets                     |
+---------------------+---------------------+--------+----------------------+
|      sample 1       |      sample 2       | ....   |      sample N        |
+---------------------+---------------------+--------+----------------------+

Fields:

field size (bytes) description
checksum 4 CRC32 checksum of metadata
N 8 number of samples
checksums 4 * N CRC32 checksum of each sample
offsets 8 * N byte offset of each sample
sample i offsets[i + 1] - offsets[i] data of the i-th sample

Get Started

Requirements

Install

pip3 install ffrecord

Usage

We provide ffrecord.FileWriter and ffrecord.FileReader for reading and writing, respectively.

Write

To create a FileWriter object, you need to specify a file name and the total number of samples. And then you could call FileWriter.write_one() to write a sample to the FFRecord file. It accepts bytes or bytearray as input and appends the data to the end of the opened file.

from ffrecord import FileWriter


def serialize(sample):
    """ Serialize a sample to bytes or bytearray

    You could use anything you like to serialize the sample.
    Here we simply use pickle.dumps().
    """
    return pickle.dumps(sample)


samples = [i for i in range(100)]  # anything you would like to store
fname = 'test.ffr'
n = len(samples)  # number of samples to be written
writer = FileWriter(fname, n)

for i in range(n):
    data = serialize(samples[i])  # data should be bytes or bytearray
    writer.write_one(data)

writer.close()

Read

To create a FileReader object, you only need to specify the file name. And then you could call FileWriter.read() to read multiple samples from the FFReocrd file. It accepts a list of indices as input and outputs the corresponding samples data.

The reader would validate the checksum before returning the data if check_data = True.

from ffrecord import FileReader


def deserialize(data):
    """ deserialize bytes data

    The deserialize method should be paired with the serialize method above.
    """
    return pickle.loads(data)


fname = 'test.ffr'
reader = FileReader(fname, check_data=True)
print(f'Number of samples: {reader.n}')

indices = [3, 6, 0, 10]      # indices of each sample
data = reader.read(indices)  # return a list of bytes data

for i in range(n):
    sample = deserialize(data[i])
    # do what you want

reader.close()

Dataset and DataLoader for PyTorch

We also provide ffrecord.torch.Dataset and ffrecord.torch.DataLoader for PyTorch users to train models using FFRecord.

Different from torch.utils.data.Dataset which accepts an index as input and returns one sample, ffrecord.torch.Dataset accepts a batch of indices as input and returns a batch of samples. One advantage of ffrecord.torch.Dataset is that it could read a batch of data at a time using Linux AIO.

We first read a batch of bytes data from the FFReocrd file and then pass the bytes data to process() function. Users need to inherit from ffrecord.torch.Dataset and define their custom process() function.

Pipline:   indices ----------------------------> bytes -------------> samples
                      reader.read(indices)               process()

For example:

class CustomDataset(ffrecord.torch.Dataset):

    def __init__(self, fname, check_data=True, transform=None):
        super().__init__(fname, check_data)
        self.transform = transform

    def process(self, indices, data):
        # deserialize data
        samples = [pickle.loads(b) for b in data]

        # transform data
        if self.transform:
            samples = [self.transform(s) for s in samples]
        return samples

dataset = CustomDataset('train.ffr')
indices = [3, 4, 1, 0]
samples = dataset[indices]

ffrecord.torch.Dataset could be combined with ffrecord.torch.DataLoader just like PyTorch.

dataset = CustomDataset('train.ffr')
loader = ffrecord.torch.DataLoader(dataset,
                                   batch_size=16,
                                   shuffle=True,
                                   num_workers=8)

for i, batch in enumerate(loader):
    # training model
You might also like...
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play around in the Colab notebook provided. Note that, in both cases, you will need to train a WaveGAN model first

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Comments
  • install error

    install error

    When I install ffrecord with python setup.py install, it failed with the following errors:

    running install
    running bdist_egg
    running egg_info
    creating ffrecord.egg-info
    writing ffrecord.egg-info/PKG-INFO
    writing dependency_links to ffrecord.egg-info/dependency_links.txt
    writing requirements to ffrecord.egg-info/requires.txt
    writing top-level names to ffrecord.egg-info/top_level.txt
    writing manifest file 'ffrecord.egg-info/SOURCES.txt'
    reading manifest file 'ffrecord.egg-info/SOURCES.txt'
    writing manifest file 'ffrecord.egg-info/SOURCES.txt'
    installing library code to build/bdist.linux-x86_64/egg
    running install_lib
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/ffrecord
    copying ffrecord/fileio.py -> build/lib.linux-x86_64-3.7/ffrecord
    copying ffrecord/__init__.py -> build/lib.linux-x86_64-3.7/ffrecord
    copying ffrecord/utils.py -> build/lib.linux-x86_64-3.7/ffrecord
    creating build/lib.linux-x86_64-3.7/ffrecord/torch
    copying ffrecord/torch/__init__.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
    copying ffrecord/torch/dataset.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
    copying ffrecord/torch/dataloader.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
    running build_ext
    -- The C compiler identification is GNU 7.5.0
    -- The CXX compiler identification is GNU 7.5.0
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Check for working C compiler: /usr/bin/cc - skipped
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Found PythonInterp: /opt/conda/bin/python (found version "3.7.10") 
    -- Found PythonLibs: /opt/conda/lib/libpython3.7m.so
    -- Performing Test HAS_CPP14_FLAG
    -- Performing Test HAS_CPP14_FLAG - Success
    -- Performing Test HAS_CPP11_FLAG
    -- Performing Test HAS_CPP11_FLAG - Success
    -- Performing Test HAS_LTO_FLAG
    -- Performing Test HAS_LTO_FLAG - Success
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /root/ffrecord/build/temp.linux-x86_64-3.7
    [ 20%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/reader.cpp.o
    [ 40%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/writer.cpp.o
    [ 60%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/utils.cpp.o
    [ 80%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o
    /root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘void ffrecord::WriterWrapper::write_one_wrapper(const pybind11::buffer&)’:
    /root/ffrecord/ffrecord/src/bindings.cpp:22:44: error: passing ‘const pybind11::buffer’ as ‘this’ argument discards qualifiers [-fpermissive]
             py::buffer_info info = buf.request();
                                                ^
    In file included from /usr/include/pybind11/cast.h:13:0,
                     from /usr/include/pybind11/attr.h:13,
                     from /usr/include/pybind11/pybind11.h:36,
                     from /root/ffrecord/ffrecord/src/bindings.cpp:1:
    /usr/include/pybind11/pytypes.h:832:17: note:   in call to ‘pybind11::buffer_info pybind11::buffer::request(bool)’
         buffer_info request(bool writable = false) {
                     ^~~~~~~
    /root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘std::vector<pybind11::array> ffrecord::ReaderWrapper::read_batch_wrapper(const std::vector<long int>&)’:
    /root/ffrecord/ffrecord/src/bindings.cpp:41:59: error: invalid conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
                 auto capsule = py::capsule(b.data, free_buffer);
                                                               ^
    In file included from /usr/include/pybind11/cast.h:13:0,
                     from /usr/include/pybind11/attr.h:13,
                     from /usr/include/pybind11/pybind11.h:36,
                     from /root/ffrecord/ffrecord/src/bindings.cpp:1:
    /usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
         explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
                  ^~~~~~~
    /root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array ffrecord::ReaderWrapper::read_one_wrapper(int64_t)’:
    /root/ffrecord/ffrecord/src/bindings.cpp:49:55: error: invalid conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
             auto capsule = py::capsule(b.data, free_buffer);
                                                           ^
    In file included from /usr/include/pybind11/cast.h:13:0,
                     from /usr/include/pybind11/attr.h:13,
                     from /usr/include/pybind11/pybind11.h:36,
                     from /root/ffrecord/ffrecord/src/bindings.cpp:1:
    /usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
         explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
                  ^~~~~~~
    /root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array_t<long int> ffrecord::ReaderWrapper::get_offsets(int)’:
    /root/ffrecord/ffrecord/src/bindings.cpp:55:58: error: invalid user-defined conversion from ‘ffrecord::ReaderWrapper::get_offsets(int)::<lambda(void*)>’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
             auto capsule = py::capsule(v.data(), [](void*) {});
                                                              ^
    /root/ffrecord/ffrecord/src/bindings.cpp:55:54: note: candidate is: ffrecord::ReaderWrapper::get_offsets(int)::<lambda(void*)>::operator void (*)(void*)() const <near match>
             auto capsule = py::capsule(v.data(), [](void*) {});
                                                          ^
    /root/ffrecord/ffrecord/src/bindings.cpp:55:54: note:   no known conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’
    In file included from /usr/include/pybind11/cast.h:13:0,
                     from /usr/include/pybind11/attr.h:13,
                     from /usr/include/pybind11/pybind11.h:36,
                     from /root/ffrecord/ffrecord/src/bindings.cpp:1:
    /usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
         explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
                  ^~~~~~~
    /root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array_t<unsigned int> ffrecord::ReaderWrapper::get_checksums(int)’:
    /root/ffrecord/ffrecord/src/bindings.cpp:61:58: error: invalid user-defined conversion from ‘ffrecord::ReaderWrapper::get_checksums(int)::<lambda(void*)>’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
             auto capsule = py::capsule(v.data(), [](void*) {});
                                                              ^
    /root/ffrecord/ffrecord/src/bindings.cpp:61:54: note: candidate is: ffrecord::ReaderWrapper::get_checksums(int)::<lambda(void*)>::operator void (*)(void*)() const <near match>
             auto capsule = py::capsule(v.data(), [](void*) {});
                                                          ^
    /root/ffrecord/ffrecord/src/bindings.cpp:61:54: note:   no known conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’
    In file included from /usr/include/pybind11/cast.h:13:0,
                     from /usr/include/pybind11/attr.h:13,
                     from /usr/include/pybind11/pybind11.h:36,
                     from /root/ffrecord/ffrecord/src/bindings.cpp:1:
    /usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
         explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
                  ^~~~~~~
    /root/ffrecord/ffrecord/src/bindings.cpp: At global scope:
    /root/ffrecord/ffrecord/src/bindings.cpp:67:16: error: expected constructor, destructor, or type conversion before ‘(’ token
     PYBIND11_MODULE(_ffrecord_cpp, m) {
                    ^
    CMakeFiles/_ffrecord_cpp.dir/build.make:117: recipe for target 'CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o' failed
    make[2]: *** [CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o] Error 1
    CMakeFiles/Makefile2:82: recipe for target 'CMakeFiles/_ffrecord_cpp.dir/all' failed
    make[1]: *** [CMakeFiles/_ffrecord_cpp.dir/all] Error 2
    Makefile:90: recipe for target 'all' failed
    make: *** [all] Error 2
    Traceback (most recent call last):
      File "setup.py", line 24, in <module>
        ext_modules=[cpp_module]
      File "/opt/conda/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 67, in run
        self.do_egg_install()
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 109, in do_egg_install
        self.run_command('bdist_egg')
      File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 164, in run
        cmd = self.call_command('install_lib', warn_dir=0)
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
        self.run_command(cmdname)
      File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install_lib.py", line 11, in run
        self.build()
      File "/opt/conda/lib/python3.7/distutils/command/install_lib.py", line 107, in build
        self.run_command('build_ext')
      File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/root/ffrecord/cmake_build.py", line 118, in build_extension
        ["cmake", "--build", "."] + build_args, cwd=self.build_temp
      File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2.
    
    bug install 
    opened by jimchenhub 3
  • Error of 0' failed. Number of submitted requests: -22"">

    Error of "RuntimeError: 'ns > 0' failed. Number of submitted requests: -22"

    I apply the sample code from README, but an error occurred in data = self.reader.read(indices) of the __getitem__ method in ffrecord.torch.dataset module. The following are more detailed error messages:


    -- Process 1 terminated with the following error:
    Traceback (most recent call last):
      File "xxxx/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
        fn(i, *args)
      File "xxxx.py", line 172, in worker
        trainer.train(args, gpu_id, rank, train_loader, model, optimizer, scheduler, train_sampler)
      File "xxxx.py", line 39, in train
        for step, batch in enumerate(loader):
      File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
        data = self._next_data()
      File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
        return self._process_data(data)
      File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
        data.reraise()
      File "xxxx/python3.8/site-packages/site-packages/torch/_utils.py", line 457, in reraise
        raise exception
    RuntimeError: Caught RuntimeError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "xxxx/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
        data = fetcher.fetch(index)
      File "xxxx/python3.8/site-packages/ffrecord-1.3.2+35c6863-py3.8-linux-x86_64.egg/ffrecord/torch/dataloader.py", line 151, in fetch
        data = self.dataset[indexes]
      File "xxx.py", line 34, in __getitem__
        data = self.reader.read(indices)
    RuntimeError: 'ns > 0' failed. Number of submitted requests: -22
    Error in std::vector<ffrecord::MemBlock> ffrecord::FileReader::read_batch(const std::vector<long int>&) at xxx/ffrecord/ffrecord/src/reader.cpp line 225
    

    What might be the cause of this error?

    opened by xlxwalex 7
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
Stanford CoreNLP provides a set of natural language analysis tools written in Java

Stanford CoreNLP Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and giv

Stanford NLP 8.8k Jan 07, 2023
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
wxPython app for converting encodings, modifying and fixing SRT files

Subtitle Converter Program za obradu srt i txt fajlova. Requirements: Python version 3.8 wxPython version 4.1.0 or newer Libraries: srt, PyDispatcher

4 Nov 25, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021