pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Tests

A Transformer-based library for SocialNLP classification tasks.

Currently supports:

  • Sentiment Analysis (Spanish, English)
  • Emotion Analysis (Spanish, English)

Just do pip install pysentimiento and start using it:

Test it in Colab

from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})

analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""

emotion_analyzer = EmotionAnalyzer(lang="en")

emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})

Also, you might use pretrained models directly with transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

Preprocessing

pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.

from pysentimiento.preprocessing import preprocess_tweet

# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"

# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"

# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"

# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"

# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'

Trained models so far

Check CLASSIFIERS.md for details on the reported performances of each model.

Spanish models

English models

Instructions for developers

  1. First, download TASS 2020 data to data/tass2020 (you have to register here to download the dataset)

Labels must be placed under data/tass2020/test1.1/labels

  1. Run script to train models

Check TRAIN_EVALUATE.md

  1. Upload models to Huggingface's Model Hub

Check "Model sharing and upload" instructions in huggingface docs.

License

pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use

  1. TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
  2. SEMEval 2017 Dataset license (Sentiment Analysis in English)

Citation

If you use pysentimiento in your work, please cite this paper

@misc{perez2021pysentimiento,
      title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
      author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
      year={2021},
      eprint={2106.09462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

TODO:

  • Upload some other models
  • Train in other languages

Suggestions and bugfixes

Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)

New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Ankur Dhuriya 10 Oct 13, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

Mutian He 19 Oct 14, 2022
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022