Predict an emoji that is associated with a text

Overview

Sentiment Analysis

Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you tell from a text whether the writer is happy? Angry? Disappointed? Can you put their happiness on a 1-5 scale?

Robust tools for sentiment analysis are often very desirable for companies, for example. Imagine that a company has just launched a new product GizmoX. Now the management wants to know how customers feel about it. Instead of calling or writing each person who bought GizmoX, if we could just have a program go on the web and find text on message boards that discuss GizmoX and automatically rate their attitude toward their recent purchase, valuable information could be obtained, practically for free. Because sentiment analysis is used so widely for this purpose, it is sometimes called Opinion Mining.

Of course, to be really accurate at analyzing sentiment you almost have to have a human in the loop. There are many subtleties in texts that computer algorithms still have a hard time with - detecting sarcasm, for example. But, for many practical purposes you don't have to be 100% accurate in your analysis for it to be useful. A sentiment analyzer that gets it right 80% of the time can still be very valuable.

Emoji Prediction

Emoji prediction is a fun variant of sentiment analysis. When texting your friends, can you tell their emotional state? Are they happy? Could you put an appropriate smiley on each text message you receive? If so, you probably understand their sentiment.

In this project, we build what's called a classifier that learns to associate emojis with sentences. Although there are many technical details, the principle behind the classifier is very simple: we start with a large amount of sentences that contain emojis collected from Twitter messages. Then we look at features from those sentences (words, word pairs, etc.) and train our classifier to associate certain features with their (known) smileys. For example, if the classifier sees the word "happy" in many sentences that also has the smiley 😂 , it will learn to classify such messages as 😂 . On the other hand, the word "happy" could be preceded by "not" in which case we shouldn't rely on just single words to be associated with certain smileys. For this reason, we also look at word sequences, and in this case, would learn that "not happy" is more strongly associated with sadness, outweighing the "happy" part. The classifier learns to look at the totality of many word sequences found in a sentence and figures out what class of smiley would best characterize that sentence. Although the principle is simple, if we have millions of words of text with known smileys associated with the sentences, we can actually learn to do pretty well on this task.

If you don't want to actually re-create the classifier, you can skip ahead to the Error Analysis section where you'll see how well it does in predicting 7 different smileys after being "trained" on some text.

Technical: Quickstart

To use this project, it's required to install python3, jupyter notebook, and some python libraries.

Install

Install python3

If you don't have python3 on your computer, there are two options:

  • Download python3 from Anaconda, which includes Python, Jupyter Notebook, and the other libraries.
  • Download python3 from python.org

Install packages

All packages used for this project are written in requirements.txt. To install, you can run

$ pip3 install -r requirements.txt

Download project

To download this project repository, you can run

$ git clone https://github.com/TetsumichiUmada/text2emoji.git

Run jupyter notebook

To start jupyter notebook, you move to the directory with cd path_to/text2emoji, then run

$ jupyter notebook

See Running the Notebook for more details.

Project Details

The goal of this project is to predict an emoji that is associated with a text message. To accomplish this task, we train and test several supervised machine learning models on a data to predict a sentiment associated with a text message. Then, we represent the predicted sentiment as an emoji.

Data Sets

The data comes from the DeepEmoji/data repository. Since the file format is a pickle, we wrote a python 2 script to covert a pickle to a txt file. The data (both pickle and txt files) and scripts are stored in the text2emoji/data directory.

Among the available data on the repository, we use the PsychExp dataset for this project. In the file, there are 7840 samples, and each line contains a text message and its sentimental labels which are represented as a vector [joy, fear, anger, sadness, disgust, shame, guilt].

In the txt file, each line is formatted like below:

[ 1.  0.  0.  0.  0.  0.  0.] Passed the last exam.

Since the first position of the vector is 1, the text is labeled as an instance of joy.

For more information about the original data sets, please check DeepEmoji/data and text2emoji/data.

Preprocess and Features

How does a computer understand a text message and analyze its sentiment? A text message is a series of words. To be able to process text messages, we need to convert text into numerical features.

One of the methods to convert a text to numerical features is called an n-grams. An n-gram is a sequence of n words from a given text. A 2-gram(bigram) is a sequence of two words, for instance, "thank you" or "your project", and a 3-gram(trigram) is a three-word sequence of words like "please work on" or "turn your homework".

For this project, first, we convert all the texts into lower case. Then, we create n-grams with a range from 1 to 4 and count how many times each n-gram appears in the text.

Models and Results

Building a machine learning model involves mainly two steps. The first step is to train a model. After that, we evaluate the model on a separate data set---i.e. we don't evaluate performance on the same data we learned from. For this project, we use four classifiers and train each classier to see which one works better for this project. To train and test the performance of each model, we split the data set into a "training set" and a "test set", in the ratio of 80% and 20%. By separating the data, we can make sure that the model generalizes well and can perform well in the real world.

We evaluate the performance of each model by calculating an accuracy score. The accuracy score is simply the proportion of classifications that were done correctly and is calculated by

$$ \text{Accuracy} = \frac{\text{number of correct classifications}}{\text{total number of classifications made}} $$

For this project, we tested following classifiers. Their accuracy scores are summarized in the table below.

Classifier Training Accuracy Test Accuracy
SVC 0.1458890 0.1410428
LinearSVC 0.9988302 0.5768717
RandomForestClassifier 0.9911430 0.4304813
DecisionTreeClassifier 0.9988302 0.4585561

Based on the accuracy scores, it seems like SVC works, but gives poor results. The LinearSVC classifier works quite well although we see some overfitting (meaning that the training accuracy is high and test accuracy is significantly lower). This means the model has difficulty generalizing to examples it hasn't seen.

We can observe the same phenomenon for the other classifiers. In the error analysis, we therefore focus on the LinearSVC classifier that performs the best.

Error Analysis

We analyze the classification results from the best performing (LinearSVC) model, using a confusion matrix. A confusion matrix is a table which summarizes the performance of a classification algorithm and reveals the type of misclassifications that occur. In other words, it shows the classifier's confusion between classes. The rows in the matrix represent the true labels and the columns are predicted labels. A perfect classifier would have big numbers on the main diagonal and zeroes everywhere else.

It is obvious that the classifier has learned many significant patterns: the numbers along the diagonal are much higher than off the diagonal. That means true anger most often gets classified as anger, and so on.

On the other hand, the classifier tends to often misclassify text messages associated with guilt, shame, and anger. This is perhaps because it's hard to pinpoint specific words or sequences of words that characterize these sentiments. On the other hand, messages involving joy are more likely to have words such as "good", "like", and "happy", and the classifier is able to handle such sentiments much better.

Future Work

To improve on the current results, we probably, first and foremost, need access to more data for training. At the same time, adding more specific features to extract from the text may also help. For example, paying attention to usage of all caps, punctuation patterns, and similar things would probably improve the classifier.

A statistical analysis of useful features through a Chi-squared test to find out more informative tokens could also provide insight. As in many other tasks, moving from a linear classifier to a deep learning (neural network) model would probably also boost the performance.

Example/Demo

Here are four example sentences and the emojis the classifier associates them with:

😂 Thank you for dinner!
😢 I don't like it
😱 My car skidded on the wet street
😢 My cat died

References

Owner
Tetsumichi(Telly) Umada
Master student @ Univ. of Colorado, Boulder
Tetsumichi(Telly) Umada
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
原神抽卡记录数据集-Genshin Impact gacha data

提要 持续收集原神抽卡记录中 可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后

117 Dec 27, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 06, 2021
fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

Florian Leuerer 26 May 27, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

wav2vec_finetune Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks Initial test: gender recognition on this dat

8 Aug 11, 2022
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 124 Jan 03, 2023
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022