ICE Tokenizer

Token id [0, 20000) are image tokens.
Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == ' ', icetk[20003] == ' ', icetk[20006] == ','.
Token id [20100, 83823) are English tokens.
Token id [83823, 145653) are Chinese tokens.
Token id [145653, 150000) are rare tokens. E.g., icetk[145803] == 'α'.

You can install the package via

pip install icetk

Tokenization

from icetk import icetk
tokens = icetk.tokenize('Hello World! I am icetk.')
# tokens == ['▁Hello', '▁World', '!', '▁I', '▁am', '▁ice', 'tk', '.']
ids = icetk.encode('Hello World! I am icetk.')
# ids == [39316, 20932, 20035, 20115, 20344, 22881, 35955, 20007]
en = icetk.decode(ids)
# en == 'Hello World! I am icetk.' # always perfectly recover (if without 
   
    )
   

ids = icetk.encode('你好世界！这里是 icetk。')
# ids == [20005, 94874, 84097, 20035, 94947, 22881, 35955, 83823]

ids = icetk.encode(image_path='test.jpeg', image_size=256, compress_rate=8)
# ids == tensor([[12738, 12430, 10398,  ...,  7236, 12844, 12386]], device='cuda:0')
# ids.shape == torch.Size([1, 1024])
img = icetk.decode(image_ids=ids, compress_rate=8)
# img.shape == torch.Size([1, 3, 256, 256])
from torchvision.utils import save_image
save_image(img, 'recover.jpg')

A unified tokenization tool for Images, Chinese and English.

Related tags

Overview

ICE Tokenizer

Tokenization

Owner

THUDM

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

skweak: A software toolkit for weak supervision applied to NLP tasks

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

A fast, efficient universal vector embedding utility package.

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Python package for performing Entity and Text Matching using Deep Learning.

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Behavioral Testing of Clinical NLP Models

A high-level Python library for Quantum Natural Language Processing

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

Open-source offline translation library written in Python. Uses OpenNMT for translations

Outreachy TFX custom component project

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

This is a simple item2vec implementation using gensim for recbole

Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

VoiceFixer VoiceFixer is a framework for general speech restoration.