Train emoji embeddings based on emoji descriptions.

Overview

emoji2vec

This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktรคschel, Isabelle Augenstein, Matko Boลกnjak, and Sebastian Riedel in their paper [1]. Most of their results are used here to build an equivalently robust model in Keras, including the rather simple training process which is solely based on emoji descriptions, but instead of using word2vec (as it was originally proposed) this version uses global vectors [2].

Overview

  • src/ contains the code used to process the emoji descriptions as well as training and evaluating the emoji embeddings
  • res/ contains the positive and negative samples used to train the emoji embeddings (originated here) as well as a list of emoji frequencies; it should also contain the global vectors in a directory called glove/ (for practical reasons they are not included in the repository, but downloading instructions are provided below)
  • models/ contains some pretrained emoji2vec models
  • plots/ contains some visualizations for the obtained emoji embeddings

Dependencies

The code included in this repository has been tested to work with Python 3.5 on an Ubuntu 16.04 machine, using Keras 2.0.8 with Tensorflow as the backend.

List of requirements

Implementation notes

Following Eisner's paper [1], training is based on 6088 descriptions of 1661 distinct emojis. Since all descriptions are valid, we randomly sample negative instances so that there is one positive example per negative example. This approach proved to produce the best results, as stated in the paper.

There are two architectures on which emoji vectors have been trained:

  • one based on the sum of the individual word vectors of the emoji descriptions (taken from the paper)

emoji2vec

  • the other feeds the actual pretrained word embeddings to an LSTM layer (this is my own addition which can be used by setting use_lstm=True i.e -l=True)

emoji2vec_lstm

Not like in the referenced paper, we used global vectors which need to be downloaded and placed in the res/glove directory. You can either download them from the original GloVe page or you can run these bash commands:

! wget -q http://nlp.stanford.edu/data/glove.6B.zip
! unzip -q -o glove.6B.zip

Arguments

All the hyperparameters can be easily changed through a command line interface as described below:

  • -d: embedding dimension for both the global vectors and the emoji vectors (default 300)
  • -b: batch size (default 8)
  • -e: number of epochs (default 80, but we always perform early-stopping)
  • -dr: dropout rate (default 0.3)
  • -lr: learning rate (default 0.001, but we also have a callback to reduce learning rate on plateau)
  • -u: number of hidden units in the dense layer (default 600)
  • -l: boolean to set or not the LSTM architecture (default is False)
  • -s: maximum sequence length (needed only if use_lstm=True, default 10, but the actual, calculated maximum length is 27 so a post-truncation or post-padding is applied to the word sequences)

Training your own emoji2vec

To train your own emoji embeddings, run python3 emoji2vec.py and use the arguments described above to tune your hyperparameters.

Here is an example that will train 300-dimensional emoji vectors using the LSTM-based architecture with a maximum sequence length of 20, batch size of 8, 40 epochs, a dropout of 0.5, a learning rate of 0.0001 and 300 dense units:

python3 emoji2vec.py -d=300 -b=8 -e=40 -dr=0.5 -lr=0.0001 -u=300 -l=True -s=20 

The script given above will create and save several files:

  • in models/ it will save the weights of the model (.h5 format), a .txt file containing the trained embeddings and a .csv file with the x, y emoji coordinates that will be used to produce a 2D visualization of the emoji2vec vector space
  • in plots/ it will save two plots of the historical accuracy and loss reached while training as well as a 2D plot of the emoji vector space
  • it will also perform an analogy-task to evaluate the meaning behind the trained vectorized emojis (printed on the standard output)

Using the pre-trained models

Pretrained emoji embeddings are available for download and usage. There are 100 and 300 dimensional embeddings available in this repository, but any dimension can be trained manually (you need to provide word embeddings of the same dimension, though). The complete emoji2vec weights, visualizations and embeddings (for different dimensions and performed on both architectures) are available for download at this link.

For the pre-trained embeddings provided in this repository (trained on the originally proposed architecture), the following hyperparameter settings have been made (respecting, in large terms, the original authors' decisions):

  • dim: 100 or 300
  • batch: 8
  • epochs: 80 (usually, early stopping around the 30-40 epochs)
  • dense_units: 600
  • dropout: 0.0
  • learning_rate: 0.001
  • use_lstm: False

For the LSTM-based pre-trained embeddings provided in the download link, the following hyperparameter settings have been made:

  • dim: 50, 100, 200 or 300
  • batch: 8
  • epochs: 80 (usually, early stopping around the 40-50 epochs)
  • dense_units: 600
  • dropout: 0.3
  • learning_rate: 0.0001
  • use_lstm: True
  • seq_length: 10

Example code for how to use emoji embeddings, after downloading them and setting up their dimension (embedding_dim):

from utils import load_vectors

embeddings_filename = "/models/emoji_embeddings_%dd.txt" % embedding_dim
emoji2vec = utils.load_vectors(filename=embeddings_filename)

# Get the embedding vector of length embedding_dim for the dog emoji
dog_vector = emoji2vec['๐Ÿ•']

Visualization

A nice visualization of the emoji embeddings has been obtained by using t-SNE to project from N-dimensions into 2-dimensions. For practical purposes, only a fraction of the available emojis has been projected (the most frequent ones, extracted according to emoji_frequencies.txt).

Here, the top 200 most popular emojis have been projected in a 2D space:

emoji2vec_vis

Making emoji analogies

The trained emoji embeddings are evaluated on an analogy task, in a similar manner as word embeddings. Because these analogies are broadly interpreted as similarities between pairs of emojis, the embeddings are useful and extendible to other tasks if they can capture meaningful linear relationships between emojis directly from the vector space [1].

According to ACL's wiki page, a proportional analogy holds between two word pairs: a-a* :: b-b* (a is to a* as b is to b*). For example, Tokyo is to Japan as Paris is to France and a king is to a man as a queen is to a woman.

Therefore, in the current analogy task, we aim to find the 5 most suitable emojis to solve a - b + c = ? by measuring the cosine distance between the trained emoji vectors.

Here are some of the analogies obtained:

๐Ÿ‘‘ - ๐Ÿšน + ๐Ÿšบ = [' ๐Ÿ‘ธ ', ' ๐Ÿ‡ฎ๐Ÿ‡ฑ ', ' ๐Ÿ‘ฌ ', ' โ™‹ ', ' ๐Ÿ’Š ']

๐Ÿ’ต - ๐Ÿ‡บ๐Ÿ‡ธ + ๐Ÿ‡ช๐Ÿ‡บ = [' ๐Ÿ‡ฆ๐Ÿ‡ด ', ' ๐Ÿ‡ธ๐Ÿ‡ฝ ', ' ๐Ÿ‡ฎ๐Ÿ‡ช ', ' ๐Ÿ‡ญ๐Ÿ‡น ', ' ๐Ÿ‡ฐ๐Ÿ‡พ ']

๐Ÿ•ถ - โ˜€ + โ›ˆ = [' ๐Ÿ‘ž ', ' ๐Ÿ  ', ' ๐Ÿ– ', ' ๐Ÿ•’ ', ' ๐ŸŽ ']

โ˜‚ - โ›ˆ + โ˜€ = [' ๐ŸŒซ ', '๐Ÿ’…๐Ÿพ', ' ๐ŸŽ ', ' ๐Ÿ“› ', ' ๐Ÿ‡ง๐Ÿ‡ฟ ']

๐Ÿ… - ๐Ÿˆ + ๐Ÿ• = [' ๐Ÿ˜ฟ ', ' ๐Ÿ ', ' ๐Ÿ‘ฉ ', ' ๐Ÿฅ ', ' ๐Ÿˆ ']

๐ŸŒƒ - ๐ŸŒ™ + ๐ŸŒž = [' ๐ŸŒš ', ' ๐ŸŒ— ', ' ๐Ÿ˜˜ ', '๐Ÿ‘ถ๐Ÿผ', ' โ˜น ']

๐Ÿ˜ด - ๐Ÿ›Œ + ๐Ÿƒ = [' ๐ŸŒž ', ' ๐Ÿ’ ', ' ๐ŸŒ ', ' โ˜ฃ ', ' ๐Ÿ˜š ']

๐Ÿฃ - ๐Ÿฏ + ๐Ÿฐ = [' ๐Ÿ’ฑ ', '๐Ÿ‘๐Ÿฝ', ' ๐Ÿ‡ง๐Ÿ‡ท ', ' ๐Ÿ”Œ ', ' ๐Ÿ„ ']

๐Ÿ’‰ - ๐Ÿฅ + ๐Ÿฆ = ['๐Ÿ’‡๐Ÿผ', ' โœ ', ' ๐ŸŽข ', ' ๐Ÿ“ฒ ', ' โ˜ช ']

๐Ÿ’Š - ๐Ÿฅ + ๐Ÿฆ = [' ๐Ÿ“ป ', ' ๐Ÿ˜ ', ' ๐ŸšŒ ', ' ๐Ÿˆบ ', '๐Ÿ‡ผ']

๐Ÿ˜€ - ๐Ÿ’ฐ + ๐Ÿค‘ = ['๐Ÿšต๐Ÿผ', ' ๐Ÿ‡น๐Ÿ‡ฒ ', ' ๐ŸŒ ', ' ๐ŸŒ ', ' ๐ŸŽฏ ']

License

The source code and all my pretrained models are licensed under the MIT license.

References

[1] Ben Eisner, Tim Rocktรคschel, Isabelle Augenstein, Matko Boลกnjak, and Sebastian Riedel. โ€œemoji2vec: Learning Emoji Representations from their Description,โ€ in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media at EMNLP 2016 (SocialNLP at EMNLP 2016), November 2016.

[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP 2014), October 2014.

Owner
Miruna Pislar
Miruna Pislar
Unit-Convertor - Unit Convertor Built With Python

Python Unit Converter This project can convert Weigth,length and ... units for y

Mahdis Esmaeelian 1 May 31, 2022
A Transformer-Based Siamese Network for Change Detection

ChangeFormer: A Transformer-Based Siamese Network for Change Detection (Under review at IGARSS-2022) Wele Gedara Chaminda Bandara, Vishal M. Patel Her

Wele Gedara Chaminda Bandara 214 Dec 29, 2022
Open Source Light Field Toolbox for Super-Resolution

BasicLFSR BasicLFSR is an open-source and easy-to-use Light Field (LF) image Super-Ressolution (SR) toolbox based on PyTorch, including a collection o

Squidward 50 Nov 18, 2022
BisQue is a web-based platform designed to provide researchers with organizational and quantitative analysis tools for 5D image data. Users can extend BisQue by implementing containerized ML workflows.

Overview BisQue is a web-based platform specifically designed to provide researchers with organizational and quantitative analysis tools for up to 5D

Vision Research Lab @ UCSB 26 Nov 29, 2022
ๅŸบไบŽFlaskๅผ€ๅ‘ๅŽ็ซฏใ€VUEๅผ€ๅ‘ๅ‰็ซฏๆก†ๆžถ๏ผŒๅœจWEB็ซฏ้ƒจ็ฝฒYOLOv5็›ฎๆ ‡ๆฃ€ๆต‹ๆจกๅž‹

ๅŸบไบŽFlaskๅผ€ๅ‘ๅŽ็ซฏใ€VUEๅผ€ๅ‘ๅ‰็ซฏๆก†ๆžถ๏ผŒๅœจWEB็ซฏ้ƒจ็ฝฒYOLOv5็›ฎๆ ‡ๆฃ€ๆต‹ๆจกๅž‹

37 Jan 01, 2023
PushForKiCad - AISLER Push for KiCad EDA

AISLER Push for KiCad Push your layout to AISLER with just one click for instant

AISLER 31 Dec 29, 2022
Session-aware Item-combination Recommendation with Transformer Network

Session-aware Item-combination Recommendation with Transformer Network 2nd place (0.39224) code and report for IEEE BigData Cup 2021 Track1 Report EDA

Tzu-Heng Lin 6 Mar 10, 2022
Neural Tangent Generalization Attacks (NTGA)

Neural Tangent Generalization Attacks (NTGA) ICML 2021 Video | Paper | Quickstart | Results | Unlearnable Datasets | Competitions | Citation Overview

Chia-Hung Yuan 34 Nov 25, 2022
CTF challenges and write-ups for MicroCTF 2021.

MicroCTF 2021 Qualifications About This repository contains CTF challenges and official write-ups for MicroCTF 2021 Qualifications. License Distribute

Shellmates 12 Dec 27, 2022
Python Actor concurrency library

Thespian Actor Library This library provides the framework of an Actor model for use by applications implementing Actors. Thespian Site with Documenta

Kevin Quick 177 Dec 11, 2022
Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

NAVER AI 34 Oct 26, 2022
Haze Removal can remove slight to extreme cases of haze affecting an image

Haze Removal can remove slight to extreme cases of haze affecting an image. Its most typical use is for landscape photography where the haze causes low contrast and low saturation, but it can also be

Grace Ugochi Nneji 3 Feb 15, 2022
I3-master-layout - Simple master and stack layout script

Simple master and stack layout script | ------ | ----- | | | | | Ma

Tobias S 18 Dec 05, 2022
Baseline of DCASE 2020 task 4

Couple Learning for SED This repository provides the data and source code for sound event detection (SED) task. The improvement of the Couple Learning

21 Oct 18, 2022
Contrastive Language-Image Pretraining

CLIP [Blog] [Paper] [Model Card] [Colab] CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pair

OpenAI 11.5k Jan 08, 2023
classification task on dataset-CIFAR10,by using Tensorflow/keras

CIFAR10-Tensorflow classification task on dataset-CIFAR10,by using Tensorflow/keras ๅœจ่ฟ™ไธ€ไธชๅบ“ไธญ๏ผŒๆˆ‘ไฝฟ็”จTensorflowไธŽkerasๆก†ๆžถๆญๅปบไบ†ๅ‡ ไธชๅท็งฏ็ฅž็ป็ฝ‘็ปœๆจกๅž‹๏ผŒ้’ˆๅฏนCIFAR10ๆ•ฐๆฎ้›†่ฟ›่กŒไบ†่ฎญ็ปƒไธŽๆต‹่ฏ•ใ€‚ๅˆ†ๅˆซไฝฟ

3 Oct 17, 2021
An open source Python package for plasma science that is under development

PlasmaPy PlasmaPy is an open source, community-developed Python 3.7+ package for plasma science. PlasmaPy intends to be for plasma science what Astrop

PlasmaPy 444 Jan 07, 2023
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

โ€ƒโ€ƒโ€ƒ VarCLR: Variable Representation Pre-training via Contrastive Learning New: Paper accepted by ICSE 2022. Preprint at arXiv! This repository contain

squaresLab 32 Oct 24, 2022
Pseudo-Visual Speech Denoising

Pseudo-Visual Speech Denoising This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021. Autho

Sindhu 94 Oct 22, 2022