Weakly-supervised Text Classification Based on Keyword Graph

Overview

Weakly-supervised Text Classification Based on Keyword Graph

How to run?

Download data

Our dataset follows previous works. For long texts, we follow Conwea. For short texts, we follow LOTClass.
We transform all their data into unified json format.

  1. Download datasets from: https://drive.google.com/drive/folders/1D8E9T-vuBE-YdAd9OBy-yS4UW4AptA58?usp=sharing

    • Long text datasets(follow Conwea):

      • 20Newsgroup Fine(20NF)
      • 20Newsgroup Coarse(20NC)
      • NYT Fine(NYT_25)
      • NYT Coarse(NYT_5)
    • Short text datasets(follow LOTClass)

      • Agnews
      • dbpedia
      • imdb
      • amazon
  2. Unzip data into './data/processed'

Another way to obtain data (Not recommended):
You can download long text data from Conwea and short text data from LOTClass and transform data into json format using our code. The code is located at 'preprocess_data/process_long.py (process_short.py) You need to edit the preprocess code to change the dataset path to your downloaded path and change the taskname. The processed data is located in 'data/processed'. We alse provide preprocess code for X-class, which is 'process_x_class.py'.

Requirements

This project is based on python==3.8. The dependencies are as follow:

pytorch
DGL
yacs
visdom
transformers
scikit-learn
numpy
scipy

Train and Eval

  • Recommend to start visdom to show the results.
visdom -p 8888

Open the browser to the server_ip:8888 to show visdom panel.

  • Train:
    • First edit 'task/pipeline.py' to specify to config file and CUDA devices you used.
      Some configuration files are provided in the config folder.

    • Start training:

      python task/pipeline.py
      
    • Our code is based on multi GPUs, may be unable to run on single GPU currently.

Run on your custom dataset.

  1. provide datasets to dir data/processed.

    • keywords.json
      keywords for each class. type: dict. key: class_index. value: list containing all keywords for this class. See provided datasets for details.

    • unlabeled.json
      unlabeled sentences in our paper. type: list. item: list with 2 items([sentence_i,label_i]).
      In order to facilitate the evaluation, we are similar to Conwea's settings, where labels of sentences are provided. The labels are only used for evaluation.

  2. provide config to dir config. You can copy one of the existing config files and change some fields, like number_classes, classifier.type, data_dir_name etc.

  3. Specify the config file name in pipeline.py and run the pipeline code.

Citation

Please cite the following paper if you find our code helpful! Thank you very much.

Lu Zhang, Jiandong Ding, Yi Xu, Yingyao Liu and Shuigeng Zhou. "Weakly-supervised Text Classification Based on Keyword Graph". EMNLP 2021.

Owner
Hello_World
Computer Science at Fudan University.
Hello_World
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

🤖 Coeus - EARIST A.C.E 💬 Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

Dids Irwyn Reyes 3 Oct 14, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Beyond Paragraphs: NLP for Long Sequences

Beyond Paragraphs: NLP for Long Sequences

AI2 338 Dec 02, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features 🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022