TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Overview

TweebankNLP

License

This repo contains the new Tweebank-NER dataset and Twitter-Stanza pipeline for state-of-the-art Tweet NLP. Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks. The Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

Installation

# please install from the source
pip install -e .

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

import stanza

# config for the `en_tweet` pipeline (trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './saved_models/ner/en_tweet_nertagger.pt'
}

# Initialize the pipeline using a configuration dict
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Running Twitter-Stanza (Command Line Interface)

NER

We provide two pre-trained Stanza NER models:

  • en_tweenut17: trained on TB2+WNUT17
  • en_tweet: trained on TB2
source twitter-stanza/scripts/config.sh

python stanza/utils/training/run_ner.py en_tweenut17 \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json

Syntactic NLP Models

We provide two pre-trained models for the following NLP tasks:

  • tweet_ewt: trained on TB2+UD-English-EWT
  • en_tweet: trained on TB2

1. Tokenization

python stanza/utils/training/run_tokenizer.py tweet_ewt \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt 

2. Lemmatization

python stanza/utils/training/run_lemma.py tweet_ewt \
--mode predict \
--score_test \
--gold_file data/depparse/en_tweet.test.gold.conllu \
--eval_file data/depparse/en_tweet.test.in.conllu 

3. POS Tagging

python stanza/utils/training/run_pos.py tweet_ewt \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu 

4. Dependency Parsing

python stanza/utils/training/run_depparse.py tweet_ewt \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.txt \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu 

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweebank,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    publisher={arXiv},
    year={2022}
}

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC).

Owner
Laboratory for Social Machines
Promoting deeper learning and understanding in human networks | Publications: http://socialmachines.org/publications
Laboratory for Social Machines
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022
auto_code_complete is a auto word-completetion program which allows you to customize it on your need

auto_code_complete v1.3 purpose and usage auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the m

RUO 2 Feb 22, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

catdochrome 2 Dec 18, 2021
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Won Joon Yoo 335 Jan 04, 2023
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 72 Dec 19, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library.

GI-Pi Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library. The SP0

Nick Bild 8 Dec 15, 2021
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022