Text-Based Ideal Points

Related tags

Deep Learningtbip
Overview

Text-Based Ideal Points

Source code for the paper: Text-Based Ideal Points by Keyon Vafa, Suresh Naidu, and David Blei (ACL 2020).

Update (June 29, 2020): We have added interactive visualizations of topics learned by our model.

Update (May 25, 2020): We have added a PyTorch implementation of the text-based ideal point model.

Update (May 11, 2020): See our Colab notebook to run the model online. Our Github code is more complete, and it can be used to reproduce all of our experiments. However, the TBIP is fastest on GPU, so if you do not have access to a GPU you can use Colab's GPUs for free.

Installation for GPU

Configure a virtual environment using Python 3.6+ (instructions here). Inside the virtual environment, use pip to install the required packages:

(venv)$ pip install -r requirements.txt

The main dependencies are Tensorflow (1.14.0) and Tensorflow Probability (0.7.0).

Installation for CPU

To run on CPU, a version of Tensorflow that does not use GPU must be installed. In requirements.txt, comment out the line that says tensorflow-gpu==1.14.0 and uncomment the line that says tensorflow==1.14.0. Note: the script will be noticeably slower on CPU.

Data

Preprocessed Senate speech data for the 114th Congress is included in data/senate-speeches-114. The original data is from [1]. Preprocessed 2020 Democratic presidential candidate tweet data is included in data/candidate-tweets-2020.

To include a customized data set, first create a repo data/{dataset_name}/clean/. The following four files must be inside this folder:

  • counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document.
  • author_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_authors - 1}, indicating the author of the corresponding document in counts.npz.
  • vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
  • author_map.txt: a [num_authors]-length file where each line denotes the name of an author in the corpus.

See data/senate-speeches-114/clean for an example of what the four files look like for Senate speeches. The script setup/senate_speeches_to_bag_of_words.py contains example code for creating the four files from unprocessed data.

Learning text-based ideal points

Run tbip.py to produce ideal points. For the Senate speech data, use the command:

(venv)$ python tbip.py  --data=senate-speeches-114  --batch_size=512  --max_steps=100000

You can view Tensorboard while training to see summaries of training (including the learned ideal points and ideological topics). To run Tensorboard, use the command:

(venv)$ tensorboard  --logdir=data/senate-speeches-114/tbip-fits/  --port=6006

The command should output a link where you can view the Tensorboard results in real time. The fitted parameters will be stored in data/senate-speeches-114/tbip-fits/params. To perform the above analyses for the 2020 Democratic candidate tweets, replace senate-speeches-114 with candidate-tweets-2020.

To run custom data, we recommend training Poisson factorization before running the TBIP script for best results. If you have custom data stored in data/{dataset_name}/clean/, you can run

(venv)$ python setup/poisson_factorization.py  --data={dataset_name}

The default number of topics is 50. To use a different number of topics, e.g. 100, use the flag --num_topics=100. After Poisson factorization finishes, use the following command to run the TBIP:

(venv)$ python tbip.py  --data={dataset_name}

You can adjust the batch size, learning rate, number of topics, and number of steps by using the flags --batch_size, --learning_rate, --num_topics, and --max_steps, respectively. To run the TBIP without initializing from Poisson factorization, use the flag --pre_initialize_parameters=False. To view the results in Tensorboard, run

(venv)$ tensorboard  --logdir=data/{dataset_name}/tbip-fits/  --port=6006

Again, the learned parameters will be stored in data/{dataset_name}/tbip-fits/params.

Reproducing Paper Results

NOTE: Since the publication of our paper, we have made small changes to the code that have sped up inference. A byproduct of these changes is that the Tensorflow graph has changed, so its random seed does not produce the same results as before the changes, even though the data, model, and inference are all the same. To reproduce the exact paper results, one must git checkout to a version of our repository from before these changes:

(venv)$ git checkout 31d161e

The commands below will reproduce all of the paper results. The following data is required before running the commands:

  • Senate votes: The original raw data can be found at [2]. The paper includes experiments for Senate sessions 111-114. For each Senate session, we need three files: one for votes, one for members, and one for rollcalls. For example, for Senate session 114, we would use the files: S114_votes.csv, S114_members.csv, S114_rollcalls.csv. Make a repo data/senate-votes and store these three files in data/senate-votes/114/raw/. Repeat for Senate sessions 111-113.
  • Senate speeches: The original raw data can be found at [1]. Specifically, we use the hein-daily data for the 114th Senate session. The files needed are speeches_114.txt, descr_114.txt, and 114_SpeakerMap.txt. Make sure the relevant files are stored in data/senate-speeches-114/raw/.
  • Senator tweets: The data was provided to us by Voxgov [3].
  • Senate speech comparisons: We use a separate data set for the Senate speech comparisons because speech debates must be labeled for Wordshoal. The raw data can be found at [4]. The paper includes experiments for Senate sessions 111-113. We need the files speaker_senator_link_file.csv, speeches_Senate_111.tab, speeches_Senate_112.tab, and speeches_Senate_113.tab. These files should all be stored in data/senate-speech-comparisons/raw/.
  • Democratic presidential candidate tweets: Download the raw tweets here and store tweets.csv in the folder data/candidate-tweets-2020/raw/.

Preprocess, run vote ideal point model, and perform analysis for Senate votes

(venv)$ python setup/preprocess_senate_votes.py  --senate_session=111
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=112
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=113
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=114
(venv)$ python setup/vote_ideal_points.py  --senate_session=111
(venv)$ python setup/vote_ideal_points.py  --senate_session=112
(venv)$ python setup/vote_ideal_points.py  --senate_session=113
(venv)$ python setup/vote_ideal_points.py  --senate_session=114
(venv)$ python analysis/analyze_vote_ideal_points.py

Preprocess, run the TBIP, and perform analysis for Senate speeches for the 114th Senate

(venv)$ python setup/senate_speeches_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-speeches-114
(venv)$ python tbip.py  --data=senate-speeches-114  --counts_transformation=log  --batch_size=512  --max_steps=150000
(venv)$ python analysis/analyze_senate_speeches.py

Preprocess, run the TBIP and Wordfish, and perform analysis for tweets from senators during the 114th Senate

(venv)$ python setup/senate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-tweets-114
(venv)$ python tbip.py  --data=senate-tweets-114  --batch_size=1024  --max_steps=100000
(venv)$ python model_comparison/wordfish.py  --data=senate-tweets-114  --max_steps=50000
(venv)$ python analysis/analyze_senate_tweets.py

Preprocess and run the TBIP for Senate speech comparisons

(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=111
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=112
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=113
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=111
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=112
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=113
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=111  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=112  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=113  --batch_size=128

Run Wordfish for Senate speech comparisons

(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=111
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=112 
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=113

Run Wordshoal for Senate speech comparisons

(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=111  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=112  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=113  --batch_size=1024

Analyze results for Senate speech comparisons

(venv)$ python analysis/compare_tbip_wordfish_wordshoal.py

Preprocess, run the TBIP, and perform analysis for Democratic candidate tweets

(venv)$ python setup/candidate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=candidate-tweets-2020
(venv)$ python tbip.py  --data=candidate-tweets-2020  --batch_size=1024  --max_steps=100000
(venv)$ python analysis/analyze_candidate_tweets.py

Make figures

(venv)$ python analysis/make_figures.py

References

[1] Matthew Gentzkow, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

[2] Jeffrey B. Lewis, Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2020). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/

[3] VoxGovFEDERAL, U.S. Senators tweets from the 114th Congress. 2020. https://voxgov.com

[4] Benjamin E. Lauderdale and Alexander Herzog. Replication Data for: Measuring Political Positions from Legislative Speech. In Harvard Dataverse, 2016. https://doi.org/10.7910/DVN/RQMIV3

Owner
Keyon Vafa
Keyon Vafa
An ML & Correlation platform for transforming disparate data points of interest into usable intelligence.

SSIDprobeCollector An ML & Correlation platform for transforming disparate data points of interest into usable intelligence. At a High level the platf

Bill Reyor 1 Jan 30, 2022
Distance correlation and related E-statistics in Python

dcor dcor: distance correlation and related E-statistics in Python. E-statistics are functions of distances between statistical observations in metric

Carlos Ramos Carreño 108 Dec 27, 2022
Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.6k Jan 08, 2023
Bayesian inference for Permuton-induced Chinese Restaurant Process (NeurIPS2021).

Permuton-induced Chinese Restaurant Process Note: Currently only the Matlab version is available, but a Python version will be available soon! This is

NTT Communication Science Laboratories 3 Dec 17, 2022
The spiritual successor to knockknock for PyTorch Lightning, get notified when your training ends

Who's there? The spiritual successor to knockknock for PyTorch Lightning, to get a notification when your training is complete or when it crashes duri

twsl 70 Oct 06, 2022
This is a collection of our NAS and Vision Transformer work.

AutoML - Neural Architecture Search This is a collection of our AutoML-NAS work iRPE (NEW): Rethinking and Improving Relative Position Encoding for Vi

Microsoft 832 Jan 08, 2023
Pytorch implementation for M^3L

Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification (CVPR 2021) Introduction This is the Py

Yuyang Zhao 45 Dec 26, 2022
AoT is a system for automatically generating off-target test harness by using build information.

AoT: Auto off-Target Automatically generating off-target test harness by using build information. Brought to you by the Mobile Security Team at Samsun

Samsung 10 Oct 19, 2022
Using fully convolutional networks for semantic segmentation with caffe for the cityscapes dataset

Using fully convolutional networks for semantic segmentation (Shelhamer et al.) with caffe for the cityscapes dataset How to get started Download the

Simon Guist 27 Jun 06, 2022
Download & Install mods for your favorit game with a few simple clicks

Husko's SteamWorkshop Downloader 🔴 IMPORTANT ❗ 🔴 The Tool is currently being rewritten so updates will be slow and only on the dev branch until it i

Husko 67 Nov 25, 2022
AI drive app that can help user become beautiful.

爱美丽 Beauty 简体中文 Features Beauty is an AI drive app that can help user become beautiful. it contain those functions: face score cheek face beauty repor

Starved Midnight 1 Jan 30, 2022
Code for "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR 2021

LoFTR: Detector-Free Local Feature Matching with Transformers Project Page | Paper LoFTR: Detector-Free Local Feature Matching with Transformers Jiami

ZJU3DV 1.4k Jan 04, 2023
Python3 / PyTorch implementation of the following paper: Fine-grained Semantics-aware Representation Enhancement for Self-supervisedMonocular Depth Estimation. ICCV 2021 (oral)

FSRE-Depth This is a Python3 / PyTorch implementation of FSRE-Depth, as described in the following paper: Fine-grained Semantics-aware Representation

77 Dec 28, 2022
The hippynn python package - a modular library for atomistic machine learning with pytorch.

The hippynn python package - a modular library for atomistic machine learning with pytorch. We aim to provide a powerful library for the training of a

Los Alamos National Laboratory 37 Dec 29, 2022
Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX.

ONNX Object Localization Network Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX. Ori

Ibai Gorordo 15 Oct 14, 2022
Universal Probability Distributions with Optimal Transport and Convex Optimization

Sylvester normalizing flows for variational inference Pytorch implementation of Sylvester normalizing flows, based on our paper: Sylvester normalizing

Rianne van den Berg 172 Dec 13, 2022
A pytorch &keras implementation and demo of Fastformer.

Fastformer Notes from the authors Pytorch/Keras implementation of Fastformer. The keras version only includes the core fastformer attention part. The

153 Dec 28, 2022
Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Varun Nair 37 Dec 30, 2022
Research on Event Accumulator Settings for Event-Based SLAM

Research on Event Accumulator Settings for Event-Based SLAM This is the source code for paper "Research on Event Accumulator Settings for Event-Based

Robin Shaun 26 Dec 21, 2022
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)

Transfer Learning for Text Classification with Tensorflow Tensorflow implementation of Semi-supervised Sequence Learning(https://arxiv.org/abs/1511.01

DONGJUN LEE 82 Oct 22, 2022