FireFlyer Record file format, writer and reader for DL training samples.

Last update: Jan 04, 2023

Related tags

Overview

FFRecord

The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux Asynchronous Input/Output (AIO) read.

File Format

Storage Layout:

+-----------------------------------+---------------------------------------+
|         checksum                  |             N                         |
+-----------------------------------+---------------------------------------+
|         checksums                 |           offsets                     |
+---------------------+---------------------+--------+----------------------+
|      sample 1       |      sample 2       | ....   |      sample N        |
+---------------------+---------------------+--------+----------------------+

Fields:

field	size (bytes)	description
checksum	4	CRC32 checksum of metadata
N	8	number of samples
checksums	4 * N	CRC32 checksum of each sample
offsets	8 * N	byte offset of each sample
sample i	offsets[i + 1] - offsets[i]	data of the i-th sample

Get Started

Requirements

OS: Linux
Python >= 3.6
Pytorch >= 1.6
libaio 0.8.3
NumPy

Install

pip3 install ffrecord

Usage

We provide ffrecord.FileWriter and ffrecord.FileReader for reading and writing, respectively.

Write

To create a FileWriter object, you need to specify a file name and the total number of samples. And then you could call FileWriter.write_one() to write a sample to the FFRecord file. It accepts bytes or bytearray as input and appends the data to the end of the opened file.

from ffrecord import FileWriter


def serialize(sample):
    """ Serialize a sample to bytes or bytearray

    You could use anything you like to serialize the sample.
    Here we simply use pickle.dumps().
    """
    return pickle.dumps(sample)


samples = [i for i in range(100)]  # anything you would like to store
fname = 'test.ffr'
n = len(samples)  # number of samples to be written
writer = FileWriter(fname, n)

for i in range(n):
    data = serialize(samples[i])  # data should be bytes or bytearray
    writer.write_one(data)

writer.close()

Read

To create a FileReader object, you only need to specify the file name. And then you could call FileWriter.read() to read multiple samples from the FFReocrd file. It accepts a list of indices as input and outputs the corresponding samples data.

The reader would validate the checksum before returning the data if check_data = True.

from ffrecord import FileReader


def deserialize(data):
    """ deserialize bytes data

    The deserialize method should be paired with the serialize method above.
    """
    return pickle.loads(data)


fname = 'test.ffr'
reader = FileReader(fname, check_data=True)
print(f'Number of samples: {reader.n}')

indices = [3, 6, 0, 10]      # indices of each sample
data = reader.read(indices)  # return a list of bytes data

for i in range(n):
    sample = deserialize(data[i])
    # do what you want

reader.close()

Dataset and DataLoader for PyTorch

We also provide ffrecord.torch.Dataset and ffrecord.torch.DataLoader for PyTorch users to train models using FFRecord.

Different from torch.utils.data.Dataset which accepts an index as input and returns one sample, ffrecord.torch.Dataset accepts a batch of indices as input and returns a batch of samples. One advantage of ffrecord.torch.Dataset is that it could read a batch of data at a time using Linux AIO.

We first read a batch of bytes data from the FFReocrd file and then pass the bytes data to process() function. Users need to inherit from ffrecord.torch.Dataset and define their custom process() function.

Pipline:   indices ----------------------------> bytes -------------> samples
                      reader.read(indices)               process()

For example:

class CustomDataset(ffrecord.torch.Dataset):

    def __init__(self, fname, check_data=True, transform=None):
        super().__init__(fname, check_data)
        self.transform = transform

    def process(self, indices, data):
        # deserialize data
        samples = [pickle.loads(b) for b in data]

        # transform data
        if self.transform:
            samples = [self.transform(s) for s in samples]
        return samples

dataset = CustomDataset('train.ffr')
indices = [3, 4, 1, 0]
samples = dataset[indices]

ffrecord.torch.Dataset could be combined with ffrecord.torch.DataLoader just like PyTorch.

dataset = CustomDataset('train.ffr')
loader = ffrecord.torch.DataLoader(dataset,
                                   batch_size=16,
                                   shuffle=True,
                                   num_workers=8)

for i, batch in enumerate(loader):
    # training model

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play around in the Colab notebook provided. Note that, in both cases, you will need to train a WaveGAN model first

91 Dec 23, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

3 Aug 19, 2022

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

169 Dec 21, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

70 Dec 12, 2022

Comments

install error

When I install ffrecord with python setup.py install, it failed with the following errors:

running install
running bdist_egg
running egg_info
creating ffrecord.egg-info
writing ffrecord.egg-info/PKG-INFO
writing dependency_links to ffrecord.egg-info/dependency_links.txt
writing requirements to ffrecord.egg-info/requires.txt
writing top-level names to ffrecord.egg-info/top_level.txt
writing manifest file 'ffrecord.egg-info/SOURCES.txt'
reading manifest file 'ffrecord.egg-info/SOURCES.txt'
writing manifest file 'ffrecord.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/ffrecord
copying ffrecord/fileio.py -> build/lib.linux-x86_64-3.7/ffrecord
copying ffrecord/__init__.py -> build/lib.linux-x86_64-3.7/ffrecord
copying ffrecord/utils.py -> build/lib.linux-x86_64-3.7/ffrecord
creating build/lib.linux-x86_64-3.7/ffrecord/torch
copying ffrecord/torch/__init__.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
copying ffrecord/torch/dataset.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
copying ffrecord/torch/dataloader.py -> build/lib.linux-x86_64-3.7/ffrecord/torch
running build_ext
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /opt/conda/bin/python (found version "3.7.10") 
-- Found PythonLibs: /opt/conda/lib/libpython3.7m.so
-- Performing Test HAS_CPP14_FLAG
-- Performing Test HAS_CPP14_FLAG - Success
-- Performing Test HAS_CPP11_FLAG
-- Performing Test HAS_CPP11_FLAG - Success
-- Performing Test HAS_LTO_FLAG
-- Performing Test HAS_LTO_FLAG - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /root/ffrecord/build/temp.linux-x86_64-3.7
[ 20%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/reader.cpp.o
[ 40%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/writer.cpp.o
[ 60%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/utils.cpp.o
[ 80%] Building CXX object CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o
/root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘void ffrecord::WriterWrapper::write_one_wrapper(const pybind11::buffer&)’:
/root/ffrecord/ffrecord/src/bindings.cpp:22:44: error: passing ‘const pybind11::buffer’ as ‘this’ argument discards qualifiers [-fpermissive]
         py::buffer_info info = buf.request();
                                            ^
In file included from /usr/include/pybind11/cast.h:13:0,
                 from /usr/include/pybind11/attr.h:13,
                 from /usr/include/pybind11/pybind11.h:36,
                 from /root/ffrecord/ffrecord/src/bindings.cpp:1:
/usr/include/pybind11/pytypes.h:832:17: note:   in call to ‘pybind11::buffer_info pybind11::buffer::request(bool)’
     buffer_info request(bool writable = false) {
                 ^~~~~~~
/root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘std::vector<pybind11::array> ffrecord::ReaderWrapper::read_batch_wrapper(const std::vector<long int>&)’:
/root/ffrecord/ffrecord/src/bindings.cpp:41:59: error: invalid conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
             auto capsule = py::capsule(b.data, free_buffer);
                                                           ^
In file included from /usr/include/pybind11/cast.h:13:0,
                 from /usr/include/pybind11/attr.h:13,
                 from /usr/include/pybind11/pybind11.h:36,
                 from /root/ffrecord/ffrecord/src/bindings.cpp:1:
/usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
     explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
              ^~~~~~~
/root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array ffrecord::ReaderWrapper::read_one_wrapper(int64_t)’:
/root/ffrecord/ffrecord/src/bindings.cpp:49:55: error: invalid conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
         auto capsule = py::capsule(b.data, free_buffer);
                                                       ^
In file included from /usr/include/pybind11/cast.h:13:0,
                 from /usr/include/pybind11/attr.h:13,
                 from /usr/include/pybind11/pybind11.h:36,
                 from /root/ffrecord/ffrecord/src/bindings.cpp:1:
/usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
     explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
              ^~~~~~~
/root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array_t<long int> ffrecord::ReaderWrapper::get_offsets(int)’:
/root/ffrecord/ffrecord/src/bindings.cpp:55:58: error: invalid user-defined conversion from ‘ffrecord::ReaderWrapper::get_offsets(int)::<lambda(void*)>’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
         auto capsule = py::capsule(v.data(), [](void*) {});
                                                          ^
/root/ffrecord/ffrecord/src/bindings.cpp:55:54: note: candidate is: ffrecord::ReaderWrapper::get_offsets(int)::<lambda(void*)>::operator void (*)(void*)() const <near match>
         auto capsule = py::capsule(v.data(), [](void*) {});
                                                      ^
/root/ffrecord/ffrecord/src/bindings.cpp:55:54: note:   no known conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’
In file included from /usr/include/pybind11/cast.h:13:0,
                 from /usr/include/pybind11/attr.h:13,
                 from /usr/include/pybind11/pybind11.h:36,
                 from /root/ffrecord/ffrecord/src/bindings.cpp:1:
/usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
     explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
              ^~~~~~~
/root/ffrecord/ffrecord/src/bindings.cpp: In member function ‘pybind11::array_t<unsigned int> ffrecord::ReaderWrapper::get_checksums(int)’:
/root/ffrecord/ffrecord/src/bindings.cpp:61:58: error: invalid user-defined conversion from ‘ffrecord::ReaderWrapper::get_checksums(int)::<lambda(void*)>’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’ [-fpermissive]
         auto capsule = py::capsule(v.data(), [](void*) {});
                                                          ^
/root/ffrecord/ffrecord/src/bindings.cpp:61:54: note: candidate is: ffrecord::ReaderWrapper::get_checksums(int)::<lambda(void*)>::operator void (*)(void*)() const <near match>
         auto capsule = py::capsule(v.data(), [](void*) {});
                                                      ^
/root/ffrecord/ffrecord/src/bindings.cpp:61:54: note:   no known conversion from ‘void (*)(void*)’ to ‘void (*)(PyObject*) {aka void (*)(_object*)}’
In file included from /usr/include/pybind11/cast.h:13:0,
                 from /usr/include/pybind11/attr.h:13,
                 from /usr/include/pybind11/pybind11.h:36,
                 from /root/ffrecord/ffrecord/src/bindings.cpp:1:
/usr/include/pybind11/pytypes.h:734:14: note:   initializing argument 2 of ‘pybind11::capsule::capsule(const void*, void (*)(PyObject*))’
     explicit capsule(const void *value, void (*destruct)(PyObject *) = nullptr)
              ^~~~~~~
/root/ffrecord/ffrecord/src/bindings.cpp: At global scope:
/root/ffrecord/ffrecord/src/bindings.cpp:67:16: error: expected constructor, destructor, or type conversion before ‘(’ token
 PYBIND11_MODULE(_ffrecord_cpp, m) {
                ^
CMakeFiles/_ffrecord_cpp.dir/build.make:117: recipe for target 'CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o' failed
make[2]: *** [CMakeFiles/_ffrecord_cpp.dir/bindings.cpp.o] Error 1
CMakeFiles/Makefile2:82: recipe for target 'CMakeFiles/_ffrecord_cpp.dir/all' failed
make[1]: *** [CMakeFiles/_ffrecord_cpp.dir/all] Error 2
Makefile:90: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
  File "setup.py", line 24, in <module>
    ext_modules=[cpp_module]
  File "/opt/conda/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 67, in run
    self.do_egg_install()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 109, in do_egg_install
    self.run_command('bdist_egg')
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/opt/conda/lib/python3.7/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/opt/conda/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/root/ffrecord/cmake_build.py", line 118, in build_extension
    ["cmake", "--build", "."] + build_args, cwd=self.build_temp
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2.

bug install

opened by jimchenhub 3

0' failed. Number of submitted requests: -22"">

Error of "RuntimeError: 'ns > 0' failed. Number of submitted requests: -22"

I apply the sample code from README, but an error occurred in data = self.reader.read(indices) of the __getitem__ method in ffrecord.torch.dataset module. The following are more detailed error messages:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "xxxx/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "xxxx.py", line 172, in worker
    trainer.train(args, gpu_id, rank, train_loader, model, optimizer, scheduler, train_sampler)
  File "xxxx.py", line 39, in train
    for step, batch in enumerate(loader):
  File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "xxxx/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "xxxx/python3.8/site-packages/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "xxxx/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "xxxx/python3.8/site-packages/ffrecord-1.3.2+35c6863-py3.8-linux-x86_64.egg/ffrecord/torch/dataloader.py", line 151, in fetch
    data = self.dataset[indexes]
  File "xxx.py", line 34, in __getitem__
    data = self.reader.read(indices)
RuntimeError: 'ns > 0' failed. Number of submitted requests: -22
Error in std::vector<ffrecord::MemBlock> ffrecord::FileReader::read_batch(const std::vector<long int>&) at xxx/ffrecord/ffrecord/src/reader.cpp line 225

What might be the cause of this error?

opened by xlxwalex 7

Releases(v1.4.0)

v1.4.0(Oct 11, 2022)

v1.4.0
Source code(tar.gz)
Source code(zip)
ffrecord-1.4.0+0cebd18-cp36-cp36m-linux_x86_64.whl(100.86 KB)
ffrecord-1.4.0+0cebd18-cp37-cp37m-linux_x86_64.whl(100.91 KB)
ffrecord-1.4.0+0cebd18-cp38-cp38-linux_x86_64.whl(97.99 KB)
v1.3.2(May 19, 2022)

Download wheels from here.
Source code(tar.gz)
Source code(zip)
v1.3.0(Mar 11, 2022)

v1.3.0
Source code(tar.gz)
Source code(zip)
ffrecord-1.3.0+c5e3012-cp36-cp36m-linux_x86_64.whl(83.47 KB)
ffrecord-1.3.0+c5e3012-cp38-cp38-linux_x86_64.whl(90.73 KB)

Owner

GitHub Repository

CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍思想来自于苏神 GlobalPointer，原始版本是基于keras实现的，模型结构实现参考现有 pytorch 复现代码【感谢!】，基于torch百分百复现苏神原始效果。数据集中文医学命名实体数据集点这里申请，很简单，共包含九类医学

85 Dec 28, 2022

Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

3 Nov 22, 2022

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022

Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

14 Dec 12, 2022

Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

1.2k Dec 17, 2022

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

2 Oct 10, 2022

Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

1 Jun 11, 2022

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

4 Jun 17, 2022

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

10 Jan 06, 2023

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

37 Nov 29, 2022

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

26 May 27, 2022

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

2 Mar 29, 2022

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见：https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

40 Sep 28, 2022

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

7 Sep 22, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

5 Oct 24, 2022

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022