Data and code accompanying the paper Politics and Virality in the Time of Twitter

Last update: Jul 02, 2022

Overview

Politics and Virality in the Time of Twitter

Data and code accompanying the paper Politics and Virality in the Time of Twitter.

In specific:

the code used for the training of our models (./code/finetune_models.py and ./code/finetune_multi_cv.py)
a Jupyter Notebook containing the major parts of our analysis (./code/analysis.ipynb)
the model that was selected and used for the sentiment analysis.
the manually annotated data used for training are shared (./data/annotation/).
the ids of tweets that were used in our analyis and control experiments (./data/main/ & ./data/control)
names, parties and handles of the MPs that were tracked (./data/mps_list.csv).

Annotated Data (./data/annotation/)

One folder for each language (English, Spanish, Greek).
In each directory there are three files:
1. *_900.csv contains the 900 tweets that annotators labelled individually (300 tweets each annotator).
2. *_tiebreak_100.csv contains the initial 100 tweets all annotators labelled. 'annotator_3' indicates the annotator that was used as a tiebreaker.
3. *_combined.csv contains all tweets labelled for the language.

Model

While we plan to upload all the models trained for our experiments to huggingface.co, currently only the main model used in our analysis can be currently be find at: https://drive.google.com/file/d/1_Ngmh-uHGWEbKHFpKmQ1DhVf6LtDTglx/view?usp=sharing

The model, 'xlm-roberta-sentiment-multilingual', is based on the implementation of 'cardiffnlp/twitter-xlm-roberta-base-sentiment' while being further finetuned on the annotated dataset.

Example usage

from transformers import AutoModelForSequenceClassification, pipeline
model = AutoModelForSequenceClassification.from_pretrained('./xlm-roberta-sentiment-multilingual/')
sentiment_analysis_task = pipeline("sentiment-analysis", model=model, tokenizer="cardiffnlp/twitter-xlm-roberta-base-sentiment")

sentiment_analysis_task('Today is a good day')
Out: [{'label': 'Positive', 'score': 0.978614866733551}]

Reference paper

For more details, please check the reference paper. If you use the data contained in this repository for your research, please cite the paper using the following bib entry:

@inproceedings{antypas2022politics,
  title={{Politics and Virality in the Time of Twitter: A Large-Scale Cross-Party Sentiment Analysis in Greece, Spain and United Kingdom}},
  author={Antypas, Dimosthenis and Preece, Alun and Camacho-Collados, Jose},
  booktitle={arXiv preprint arXiv:2202.00396},
  year={2022}
}

Data and code accompanying the paper Politics and Virality in the Time of Twitter

Related tags

Overview

Politics and Virality in the Time of Twitter

Annotated Data (./data/annotation/)

Model

Example usage

Reference paper

Owner

Cardiff NLP

A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

Functional tensors for probabilistic programming

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

TextDescriptives - A Python library for calculating a large variety of statistics from text

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Semi-Automated Data Processing

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

This is a repo documenting the best practices in PySpark.

Python package for analyzing sensor-collected human motion data

Falcon: Interactive Visual Analysis for Big Data

Spaghetti: an open-source Python library for the analysis of network-based spatial data

Additional tools for particle accelerator data analysis and machine information

A collection of robust and fast processing tools for parsing and analyzing web archive data.

High Dimensional Portfolio Selection with Cardinality Constraints

This python script allows you to manipulate the audience data from Sl.ido surveys

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

ETL pipeline on movie data using Python and postgreSQL

In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.