An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Overview

BERTify

This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.

Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of 25000 texts) took less than ~28 mins. on Colab's GPU. (Haven't perform any hardcore benchmark, so take these numbers with a grain of salt).

Requirements

  • numpy
  • torch
  • tqdm
  • transformers

Quick Installation

$ pip install git+https://github.com/khalidsaifullaah/BERTify

Usage

num. of texts, 4096 -> embedding dim.) # Example 2: English Embedding Extraction en_bertify = BERTify( lang="en", last_four_layers_embedding=True ) # bn_bertify.batch_size = 96 texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."] en_embeddings = en_bertify.embedding(texts) # shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.) ">
from bertify import BERTify

# Example 1: Bengali Embedding Extraction
bn_bertify = BERTify(
    lang="bn",  # language of your text.
    last_four_layers_embedding=True  # to get richer embeddings.
)

# By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk.

# bn_bertify.batch_size = 96

# A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction)
texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"]

bn_embeddings = bn_bertify.embedding(texts)   # returns numpy matrix 
# shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.)




# Example 2: English Embedding Extraction
en_bertify = BERTify(
    lang="en",
    last_four_layers_embedding=True
)

# bn_bertify.batch_size = 96

texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."]
en_embeddings = en_bertify.embedding(texts) 
# shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)

Tips

  • Try passing all your text data through the .embedding() function at once by turning it into a list of texts.
  • For faster inference, make sure you're using your colab/kaggle GPU while making the .embedding() call
  • Try increasing the batch_size to make it even faster, by default we're using 64 (to be on the safe side) which doesn't throw any CUDA out of memory but I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until 96. So, before making the .embedding() call, you can do bertify.batch_zie=96 to set a larger batch_zie

Definitions


class BERTify(lang: str = "en", last_four_layers_embedding: bool = False)


A module for extracting embedding from BERT model for Bengali or English text datasets. For 'en' -> English data, it uses bert-base-uncased model embeddings, for 'bn' -> Bengali data, it uses sahajBERT model embeddings.

Parameters:

lang (str, optional): language of your data. Currently supports only 'en' and 'bn'. Defaults to 'en'. last_four_layers_embedding (bool, optional): BERT paper discusses they've reached the best results by concatenating the output of the last four layers, so if this argument is set to True, your embedding vector would be (for bert-base model for example) 4*768=3072 dimensional, otherwise it'd be 768 dimensional. Defaults to False.


def BERTify.embedding(texts: List[str])


The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.

Parameters:

texts (List[str]): A list of texts, that you want to extract embedding for (e.g. ["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"])

Returns:

np.ndarray: A numpy matrix of shape num_of_texts x embedding_dimension

License

MIT License.

Owner
Khalid Saifullah
love to learn new things.
Khalid Saifullah
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 04, 2023
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Deep-Learning-for-Text-Document-Classification Text classification is one of the popular tasks in NLP that allows a program to classify free-text docu

Happy N. Monday 2 Mar 17, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022