Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Last update: Dec 16, 2022

Overview

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention

ACL2021 Findings

Usage

0. Prepare environment

Requirements:

python==3.6
tensorflow-gpu==1.13.1
scipy==1.5.2
scikit-learn==0.23.2

1. Prepare data

Download preprocessed datasets from Google Drive and extract files to the path ./data.

2. Run the model

python main.py --data_dir ./data/{dataset} --output_dir ./output

3. Evaluation

topic coherence: coherence score.

topic diversity:

python utils/TU.py --data_path {path of topic word file}

Citation

If you are interested in our work, please cite as

@inproceedings{wu2021discovering,
    title = "Discovering Topics in Long-tailed Corpora with Causal Intervention",
    author = "Wu, Xiaobao  and
    Li, Chunping  and
    Miao, Yishu",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.15",
    doi = "10.18653/v1/2021.findings-acl.15",
    pages = "175--185",
}

Other related works

EMNLP2020 Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder

NLPCC2020 Learning Multilingual Topics with Neural Variational Inference

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Related tags

Overview

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention

Usage

0. Prepare environment

1. Prepare data

2. Run the model

3. Evaluation

Citation

Other related works

Owner

Xiaobao Wu

Tools for curating biomedical training data for large-scale language modeling

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Mkdocs + material + cool stuff

A raytrace framework using taichi language

A framework for cleaning Chinese dialog data

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Open-World Entity Segmentation

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Some embedding layer implementation using ivy library

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Open source code for AlphaFold.

DVC-NLP-Simple-usecase

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Contains links to publicly available datasets for modeling health outcomes using speech and language.

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

🎐 a python library for doing approximate and phonetic matching of strings.

This library is testing the ethics of language models by using natural adversarial texts.