NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Overview

Project 3: Web APIs & NLP

Problem Statement

How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration?

The goal of the project is to see how these two ideologically similar subreddits perceive Biden and his term as president so far.

Success in this project isn't to necessarily develop a model that accurately predicts consistently, but rather to convey what issues these two ideologies care about and the overall sentiment both subreddits have regarding Biden. Considering a lot of this information will be rather focused on EDA, it's hard to necessarily judge the success of this project on the individual models created, rather the success of this project will be determined primarily in the EDA, Visualization, and Presentation sections of the actual project. With that being said however, I will still use a wide variety of models to determine the predictive value of the data I gathered.

Hypothesis: I believe that the two subreddits will differ significantly on what issues they discuss and their sentiment towards Biden, I think because of these differences a model can be made that can accurately predict which post belongs to who. Primarily, I will be focusing on the differences between these subreddits in sentiment and words used.

Data Collection

When collecting data, I initially didn't have the problem statement in mind necessarily before I started. When I began data collecting, I knew I wanted to do something political specifically on the Biden admin post innaguration but I really wanted to go through the process experimenting with different subreddits which made for an interesting situation.

I definitely learned a lot more about the API going into the data collection process blind,such as knowing to avoid deleted posts by excluding "[deleted]" from the selftext among other things, especially about using score and created_utc for gathering posts. I would say the most difficult process was just finding subreddits and then subsequently seeing if they have enough posts while trying to construct different problem statements using the viable subreddits.

At the end, I decided on just choosing r/neoliberal and r/libertarian, there might've been easier options for model creation but personally, I found it a lot more interesting especially since I already browse r/neoliberal fairly frequently so I was invested in the analysis.

Data Cleaning and EDA

When performing data cleaning and EDA, I really did these two tasks in two seperate notebooks. My logistic regression notebook and in my notebook dedicated to EDA and data cleaning. The reason for that being, I initially just had the logistic regression notebook but then wanted to do further analysis on vectorized sets so I created it's own notebook for that while still at times referencing ideal vectorizer parameters I found in my logistic regression notebook.

Truth be told, I did some cleaning in the data gathering notebook, just checking if there were any duplicates or if there were any oddities that I found and I didn't find much, there might have been a few removed posts that snuck in to my analysis but truth be told, it wasn't anything warranting an editing of my data gathering techniques or anything that would stop me from using the data I already gathered.

EDA primarily was just trying to find words that stuck out using count vectorizers, luckily, that was fairly easy to do considering the NLP process came fairly naturally to me. I used lemmatizers for model creation but I rarely used it for my actual EDA, I primarily just used a basic tokenizer without any added features. The bulk of my presentation directly comes from this and domain knowledge where I can create conclusions from the information gathered from this EDA process. EDA helped present a narrative that I was able to fully formulate with my domain knowledge which then resulted in the conclusions found in my presentation.

Another part of EDA that was critical, was the usage of sentiment analysis to find the difference in overall tone between the two subreddits on Biden, this was especially important in my analysis as it also ended up being apart of my preprocessing as well. Sentiment analysis was used in my presentation to present the differences in tone towards Biden but also emphasize the amount of neturality in the posts themselves, this is due primarily to the posts being titles of politically neutral news titles or tweets.

Preprocessing and Modelling

Modelling was a very tenuous process and Preprocessing as well because a lot of it was very memory intensive which resulted in a lot of time spent baby-sitting my laptop but ultimately it provided a lot of valuable information not only on the data I was investigating but also on the models I was using. I used bagging classifiers, logistic regression models, decision trees, random forest models, and boosted models. All of these I had to very mixed success but logistic regression was the one I had the most consistency with, especially with self text exclusive posts. Random forest, decision trees, and boosted models, I all had high expectations for but was not as consistently effective as the logistic regression models. Due to general model underperformance, I will be primarily talking about the logistic regression models I created in the logreg notebook as I had dedicated the most time finetuning those models and had generally more consistent performance with those models than I did others.

I specifically had massive troubles with predicting neoliberal posts while Libertarian posts, I generally managed a decent rate at. My specificity was a lot better than my sensitivity. When I judged my model's ability to predict, I looked at self-text, title-exclusive, and total text. This allowed me to individually look at what each model was good at predicting and also what data to gather the next time I interact with this API.

My preprocessing was very meticulous, specifically experimenting with different vectorizer parameters when using my logistic regression model. Adjustment of parameters and the addition of sentiment scores to try and help the model's performance. Adjusting the vectorizer parameters such as binary and others were heavily tweaked depending on the X variable used (selftext, title, totaltext).

Conclusion

When analyzing this data, it is clear that there are three key takeaways from my modeling process and EDA stage.

  1. The overwhelming neutrality in the text (specifically the title) itself, can hide the true opinions of those in the subreddit.

  2. Predictive models are incredibly difficult to perform on these subreddits in particular and potentially other political subreddits.

  3. The issues in which the subreddits most differ on, is primarily due to r/Libertarian focusing more on surveillance and misinformation in the media while r/Neoliberal is concerned with global politics, climate, and sitting senate representatives.

  4. They both discuss tax, covid, stimulus, china and other current topics relatively often

Sources Used

Britannica Definition of Libertarianism

Neoliberal Project

Stanford Philosophy: Libertarianism

Stanford Philosophy: Neoliberalism

Neoliberal Podcast: Defining Neoliberalism

r/Libertarian

r/neoliberal

Owner
Adam Muhammad Klesc
Hopeful data scientist. Currently in General Assembly and taking their data science immersive course!
Adam Muhammad Klesc
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization 🤍 Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

8 Dec 25, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022