A method for cleaning and classifying text using transformers.

Overview

NLP Translation and Classification

The repository contains a method for classifying and cleaning text using NLP transformers.

Overview

The input data are web-scraped product names gathered from various e-shops. The products are either monitors or printers. Each product in the dataset has a scraped name containing information about the product brand, and product model name, but also unwanted noise - irrelevant information about the item. Additionally, only some records are relevant, meaning that they belong to the correct category: monitor or printer, while other records belong to unwanted categories like accessories or TVs.

The goal of the tasks is to preprocess web-scraped data by removing noisy records and cleaning product names. Preliminary experiments showed that classic machine learning methods like tf-idf vectorization and classification struggled to achieve good results. Instead NLP transformers were employed:

  • First, DistilBERT was utilized for removing irrelevant records. The available data are monitors with annotated labels where the records are classified into three classes: "Monitor", "TV", and "Noise".
  • After, T5 was applied for cleaning product names by translating scraped name into clean name containing only product brand and product model name. For instance, for the given input "monitor led aoc 24g2e 24" ips 1080 ..." the desired output is "aoc | 24g2e". The available data are monitors and printers with annotated targets.

The datasets are split into training, validation and test sets without overlapping records.

The results and details about training and evaluation procedure can be found in the Jupyter Notebooks, see Content section below.

Content

The repository contains Jupyter Notebooks for training and evaluating NNs:

  • 01_data_exploration.ipynb - The notebook contains an exploration of the datasets for sequence classification and translation. It includes visualization of distributions of targets, and overview of available metadata.
  • 02a_classification_fine_tuning.ipynb - The notebook fine-tunes a DistilBERT classifier using training and validation sets, and saves the trained checkpoint.
  • 02b_classification_evaluation.ipynb - The notebook evaluates classification scores on the test set. It includes: a classification report with precision, recall and F1 scores; and a confusion matrix.
  • 03a_translation_fine_tuning.ipynb - The notebook fine-tunes a T5 translation network using training and validation sets, and saves the trained checkpoint.
  • 03b_translation_evaluation.ipynb - The notebook evaluates translation metrics on the test set. The metrics are: Text Accuracy (exact match of target and predicted sequences); Levenshtein Score (normalized reversed Levenshtein Distance where 1 is the best and 0 is the worst); and Jaccard Index.
  • 04_benchmarking.ipynb - The notebook evaluates GPU memory and time needed for running inference on DistilBERT and T5 models using various values of batch size and sequence length.

Getting Started

Package Dependencies

The method were developed using Python=3.7 with transformers=4.8 framework that uses PyTorch=1.9 machine learning framework on a backend. Additionally, the repository requires packages: numpy, pandas, matplotlib and datasets.

To install required packages with PyTorch for CPU run:

pip install -r requirements.txt

For PyTorch with GPU run:

pip install -r requirements_gpu.txt

The requirement files do not contain jupyterlab nor any other IDE. To install jupyterlab run

pip install jupyterlab

Contact

Rail Chamidullin - [email protected] - Github account

Owner
Ray Chamidullin
Ray Chamidullin
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
The Classical Language Toolkit

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.clt

Classical Language Toolkit 754 Jan 09, 2023
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022
kochat

Kochat 챗봇 빌더는 성에 안차고, 자신만의 딥러닝 챗봇 애플리케이션을 만드시고 싶으신가요? Kochat을 이용하면 손쉽게 자신만의 딥러닝 챗봇 애플리케이션을 빌드할 수 있습니다. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022