Train BPE with fastBPE, and load to Huggingface Tokenizer.

Last update: Dec 23, 2021

Related tags

Overview

BPEer

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Description

The BPETrainer of Huggingface consumes a lot of memory when I am training on a large corpus (e.g. 50000 merges on 20GB corpus). And I got a memory error.

So I use fastBPE (implemented with C) instead, which returns a list of merge operations.

However, I still want to use the huggingface Tokenizer API. So I write a simple convertor for generating the json file for Huggingface Tokenizer.

Usage

Train BPE:

cd fastBPE
./fast learnbpe [merges, e.g. 50000] [train.txt] > allvocab

Convert to json:

python convertjs.py

Warning

This tokenizer does not indicate the start of a token.

E.g. BPE result for "I am" and "Iam" may be the same. Please split the sentence by space before you use it.

    words = "I am".split()
    for word in words:
        subs = tokenizer.tokenize(word)
        subs[0] = "
   
    "
    + subs[0]

This results in [" I", "am"] and [" I", " am"] for "Iam" and "I am".

Owner

Lizhuo

二律背反的双重人格

GitHub Repository

A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

2 Jan 10, 2022

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

1.3k Dec 28, 2022

Awesome-NLP-Research (ANLP)

72 Dec 19, 2022

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

28 Dec 07, 2022

Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

6 Dec 20, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021

Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

423 Jan 01, 2023

Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

483 Jan 02, 2023

StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

5.1k Dec 30, 2022

Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

151 Dec 14, 2022

An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据，遵循文件结构如下： ├── data │ ├── train │

10 Jul 19, 2022

Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

1 Feb 01, 2022

InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

2.2k Dec 27, 2022

Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

47 Sep 10, 2022

Pretrained Japanese BERT models

Pretrained Japanese BERT models This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face. Mod

387 Dec 30, 2022

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

8 Dec 16, 2022

An Explainable Leaderboard for NLP

319 Dec 20, 2022

BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

401 Dec 11, 2022