SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

Overview

SentimentArcs Logo

SentimentArcs - Emotion in Text

An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text.
Explore the docs »

Quick Video Overview · Cambridge University Press Elements Textbook by Katherine Elkins · Report a Bug or Request a Feature · More Research by Jon Chun and Katherine Elkins · References on Sentiment Analysis, AffectiveAI and Related Topics

Table of Contents

  1. Welcome
  2. Background
  3. Features
  4. Sentiment Analysis Models
  5. Notebooks and Dataflow
  6. Reference Corpora
  7. Installation
  8. Examples
  9. License
  10. Contact and Contribute
SentimentArcs Ensemble of Machines Like Me by Ian McEwan Fig 1: SentimentArcs Ensembles over three dozen Sentiment Analysis Models from simple XAI Lexicons to State-of-the-Art Transformers (including Models specialized for Financial and Social Texts)

SentimentArcs Peak Detection for Machines Like Me by Ian McEwan Fig 2: Efficient Exploratory Data Analysis (EDA) by Domain Expert to customize Models, Hyperparameters and Time Series Processing

SentimentArcs Crux Extraction for Machines Like Me by Ian McEwan Fig 3: Automatic Peak/Valley detection and Text Extraction around Crux Points

Welcome!

SentimentArcs is a novel methodology and software framework for analyzing emotion in long texts or sequenced collections of shorter texts using Diachronic Sentiment Analysis. It segments any corpus of long text into semantic units (e.g. sentences, tweets, financial posts), applying an ensemble of over three dozen NLP sentiment analysis models from simple lexical models to state-of-the-art Transformer models. The resulting sentiment time series can be smoothed so key features like peaks and valleys can be detected and the surrounding text around these key crux points can be extracted for analysis by domain experts.

For literary experts features like peaks and valleys often correspond to key crux points in a narrative. For a financial analyst, these could represent regime changes or arbitrage opportunities. For a social media analysts, these swings in could represent shifting public opinion on key topics, public figures or even terrorist cell activities. SentimentArcs is built around a large ensemble of sentiment analysis models that surface interesting emotional arcs that domain experts can use to efficiently detect subtle and complex ground truths hidden within any sequenced body of text.

(back to top)

Background

SentimentArcs is the result of many years of our experiences researching a wide variety of AI and machine learning techniques to assist human experts in the extremely challenging task of analyzing and generating natural language texts. This includes a focus on AffectiveAI approaches to analyzing diverse textual corpora including literature, social media, news, scripts, lyrics, speeches, poems, financial reports, legal documents, etc. Virtually all sequential long-form texts have detectable and measurable sentiment changes over time that reveals cohesive narrative elements. SentimentArcs helps domain experts efficiently arbitrate between competing machine learning and AI NLP models to quickly and efficiently identify, analyze and discover latent narratives elements and emotional arcs in text.

Cambridge Elements: Digital Literary Studies

SentimentArcs is the novel software framework underlying Katherine Elkins upcoming Cambridge Elements book . This text speaks to the domain expert in Narrative Studies, Comparative Literature and English who want to learn how to use NLP sentiment analysis in general, and SentimentArcs in particular, for analyzing literature. The approach in this Cambridge Elements text is entirely generalizable to other fields. A more technical introduction to the core framework of SentimentArcs can be found in the October 2021 ArXiv paper by Jon Chun. The Abstract of this paper outlines the technical focus and practical goals of SentimentArcs:

SOTA Transformer and DNN short text sentiment classifiers report over 97% accuracy on narrow domains like IMDB movie reviews. Real-world performance is significantly lower because traditional models overfit benchmarks and generalize poorly to different or more open domain texts. This paper introduces SentimentArcs, a new self-supervised time series sentiment analysis methodology that addresses the two main limitations of traditional supervised sentiment analysis: limited labeled training datasets and poor generalization. A large ensemble of diverse models provides a synthetic ground truth for self-supervised learning. Novel metrics jointly optimize an exhaustive search across every possible corpus:model combination. The joint optimization over both the corpus and model solves the generalization problem. Simple visualizations exploit the temporal structure in narratives so domain experts can quickly spot trends, identify key features, and note anomalies over hundreds of arcs and millions of data points. To our knowledge, this is the first self-supervised method for time series sentiment analysis and the largest survey directly comparing real-world model performance on long-form narratives.

Arxiv.org SentimentArcs Paper

(back to top)

Features

  • The largest ensemble of open NLP sentiment analysis models that we know of (currently over 3 dozen)
  • Efficient and Flexible Human-in-the-Loop to supervise, customize, tune the entire end-to-end process of sentiment analysis
  • Flexible statistical, visualization and text customizations so Domain Experts can easily identify, extract and analyze key features and surrounding text from sentiment time series.
  • Access to domain-specific baselines (Novels, Finance and Social Media) based upon carefully curated corpora
  • Novel Time Series Synthesis and Data Augmentation for NLP Sentiment Analysis Time Series
  • Novel Peak Detection Algorithms customized for NLP Sentiment Analysis Time Series
  • Easy access via free Google Colab Jupyter notebooks with access to powerful GPU accelerators
  • Minimal setup, training and support costs

(back to top)

Sentiment Analysis Models

  • Text preprocessing (cleaning, advanced sentence segmentation, custom stopword sets, etc)
  • An ensemble of over 3 dozen Sentiment Analysis Models including a diverse representation of major families (including the most popular sentiment analysis libraries and models from both R and Python as well as some AutoML techniques):
  • Lexical
  • Heuristics
  • Linguistic
  • GOFAI Machine Learning
  • Deep Neural Networks & AutoML
  • State of the Art Transformer Models

(back to top)

Notebooks and Dataflow

Concretely, SentimentArcs consists of a series of software modules embodied as Jupyter notebooks and supporting libraries designed to work on Google's free Colab service. Notebooks are executed in sequence reflecting different steps in the pipeline from text cleaning to sentiment time series analysis. Despite some shortcomings, Google Colab offers the lowest technical barrier for the widest range of non-technical Domain Experts as well as powerful-GPU backed Jupyter notebooks required for the most powerful state-of-the-art models in our ensemble. SentimentArc models/notebooks include:

SentimentArcs is best viewed as an ordered pipeline of Google Colab Jupyter Notebooks that are run in sequence as follows:

  1. Notebook 0: Copy SentimentArcs Github repo to your Google GDrive (run once at setup or to reset)
  2. Notebook 1: Preprocessing Text
  3. Notebook 2: Sentiment Analysis Models: R Lexicon and Heuristic using SyuzhetR(4) and SentimentR(8)
  4. Notebook 3: Sentiment Analysis Models: Python Lexicon, Heuristic and ML
  5. Notebook 4: Sentiment Analysis Models: DNN and AutoML
  6. Notebook 5: Sentiment Analysis Models: Transformers(11)
  7. Notebook 6: Analysis, Visualizations, Smoothing and Crux Extraction

SentimentArcs Notebook DataFlow

Data flows through the project subdirectory structure in a corresponding manner:

  1. text_raw: minimally prepared textfiles for the corpus
  2. text_clean: text further cleaned by SentimentArcs
  3. sentiment_raw: raw sentiment values for all texts in the corpus
  4. sentiment_clean: processed sentiment time series
  5. graphs_cruxes: extracted key features/crux points with surrounding text

(back to top)

Reference Corpora

SentimentArcs can be viewed as an end-to-end pipeline to detect, extract, preprocess and analyze sentiment in any corpus of long-form texts. This includes both individual long-form texts as well as corpora compiled from individually time-sequenced smaller texts like compilations of specific authors, genres, or periods as well as tweets, financial reports, topical news articles, speeches, etc. Initially, SentimentArcs is focused on offering users both carefully curated reference corpora to provide a ground truth and a baseline reference for specific genres of text including novels, financial texts and social media. SentimentArcs also enables users to create new corpora of customized texts for specialized sentiment analysis tasks and analysis. Currently, SentimentArcs provides reference corpora for these types of texts (with more to be added in the future):

  • Novels
  • Financial Texts
  • Social Media

For example, the reference corpus for novels consists of 25 narratives selected to create a diverse set of well-recognized novels that can serve as a benchmark for sentiment analysis of other texts. The novel corpora span approximately 2300 years from Homer’s Odyssey to the 2019 Machines like Me by award-winning author, Ian McEwan. Early 20th century modernists are emphasized by authors like Marcel Proust and Virginia Woolf. In sum, the corpora include (1) the two most popular novels on Gutenberg.org (Project Gutenberg, 2021b), (2) eight of the fifteen most assigned novels at top US universities (EAB, 2021), and (3) three works that have sold over 20 million copies (Books, 2021). There are eight works by women, two by African-Americans and five works by two LGBTQ authors. Britain leads with 15 authors followed by 6 Americans and one each from France, Russia, North Africa and Ancient Greece.

(back to top)

Installation

SentimentArcs relies upon Google to provide easy-to-use, ubiquitous and free access to powerful GPU-backed Jupyter Notebooks. Here are the free resources you should sign-up to use SentimentArcs:

  • Google GMail Account (to have access to GDrive)
  • Activate Colab Jupyter Notebooks to your GDrive from the Google Workspace Market
  • Github account (if you which to report issues or comment)

Colab Jupyter Notebooks offer several significant advantages including easy access via an intuitive web browser, low/no support costs and a powerful GPU backend VM for free. However, it comes with some limitations to be aware of including required sequential execution, latencies and limited interface design.

To set up SentimentArcs, please follow the instructions below carefully as each step depends upon the previous steps.

  1. Login to Google, go to your GDrive and create a subfolder to hold your copy of the SentimentArc project (e.g. /MyDrive/sentimentarcs_notebooks/)
  2. Be sure you have connected the Colab Notebooks app from the Google Workplace Market.
  3. Navigate to your SentimentArcs project subdirectory and create/open a new Colab Notebook.
  4. On the new blank Colab Notebook, to the top left corner and select [File]->[Open Notebook]. When a pop-up window appears, select the [Github] from the right side of the top horizontal menu. Enter 'https://github.com/jon-chun/sentimentarcs_notebooks' on the top line after the prompt [Enter a GitHub URL or search by organization or user], click the search icon, and select 'sentiment_arcs_part1_text_preprocessing.ipynb' from the list below.
  5. Run the first code cell to 'Connect Google GDrive' and grant permission for this notebook to connect to your GDrive.
  6. Edit the input on the second code cell to point to the SentimentArcs project directory you defined in Step 1 and other information asked. Be sure to execute this code cell after entering this information.
  7. Executing the next cell should copy over the current SentimentArcs code from Github if it does not already exist in your GDrive.

(back to top)

Examples

At DHColab we use Sentiment Analysis to analyze and extract features from all kinds of texts including: novels, social media, news, filings financial filings, lyrics, speeches, research papers, lyrics, poems, etc. SentimentArcs is a formalization of many of the best practices we developed over the years. Each type of text (e.g. Novels, Social Media, News, Financial Texts, etc) shares common anlysis techniques as well as requires customized methodologies tailored to each genre. For example, for novels we are seeking to surface latent features of narrative like plot, financial texts often reveal shifts in investor sentiment, and peaks/valleys in social media sentiment can reflect shifts in public opinion on current events, political candidates or new products/services.

In addition to Dr. Elkins Cambridge Elements text referenced above, here are serveral examples from our DHColab that demonstrate the use of sentiment analysis to analyze various types of text.

Novels:

  1. Adapted Arcs: Sentiment Analysis and The Sorcerer's Stone by Erin Shaheen
  2. Doubles and Reflections: Sentiment Analysis and Vladimir Nabokov’s Pale Fire by Catherine Perloff

Financial Texts:

  1. Computational Approaches to Predicting Cryptocurrency Prices by Chris Pelletier

Social Media:

  1. Analyzing Covid-19 Through a Sentiment Analysis of Twitter Data by Cameron Catana

License

MIT License

Contact and Contribute

SentimentArcs arose from a multi-year collaboration between academia and industry and across disciplines including comparative literature, econometrics, social sciences, data analytics and ML/AI among others. The world is too interconnected and the solutions to interesting important challenging problems are too complex for any one domain expert.

As a result, we welcome collaboration and contributions that can help grow SentimentArcs into the premier NLP tool for sentiment analysis which includes experts from both technical and non-technical domains. Here are just a few ways you could contribute to SentimentArcs, the broader Digital Humanities, and NLP community:

  1. Use SentimentArcs to analyze existing reference corpora to identify strengths/limitations of various models, optimal hyperparameters, interpretations, etc
  2. Contribute new texts (e.g. novels, financial reports, social media compilations)
  3. Compile, expand upon the reference corpora for Finance, Social Media, or other text genres
  4. Suggest or contribute code to add new sentiment analysis models
  5. Help with documentation, training and interpretation
  6. Bug identification/fixes
  7. Suggestions or code for new features and improved performance

(back to top)

Owner
jon_chun
jon_chun
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

Machel Reid 10 Dec 27, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

Zhe Gan 109 Dec 31, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021