Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

Overview

New State-of-the-Art in Preposition Sense Disambiguation

Supervisor:

Institutions:

Project Description

The disambiguation of words is a central part of NLP tasks. In particular, there is the ambiguity of prepositions, which has been a problem in NLP for over a decade and still is. For example the preposition 'in' can have a temporal (e.g. in 2021) or a spatial (e.g. in Frankuft) meaning. A strong motivation behind the learning of these meanings are current research attempts to transfer text to artifical scenes. A good understanding of the real meaning of prepositions is crucial in order for the machine to create matching scenes.

With the birth of the transformer models in 2017 [1], attention based models have been pushing boundries in many NLP disciplines. In particular, bert, a transformer model by google and pre-trained on more than 3,000 M words, obtained state-of-the-art results on many NLP tasks and Corpus.

The goal of this project is to use modern transformer models to tackle the problem of preposition sense disambiguation. Therefore, we trained a simple bert model on the SemEval 2007 dataset [2], a central benchmark dataset for this task. To the best of our knowledge, the best purposed model for disambiguating the meanings of prepositions on the SemEval achives an accuracy of up to 88% [3]. Neither more recent approaches surpass this frontier[4][5] . Our model achives an accuracy of 90.84%, out-performing the current state-of-the-art.

How to train

To meet our goals, we cleand the SemEval 2007 dataset to only contain the needed information. We have added it to the repository and can be found in ./data/training-data.tsv.

Train a bert model:
First, install the requirements.txt. Afterwards, you can train the bert-model by:

python3 trainer.py --batch-size 16 --learning-rate 1e-4 --epochs 4 --data-path "./data/training_data.tsv"

The chosen hyper-parameters in the above example are tuned and already set by default. After training, this will save the weights and config to a new folder ./model_save/. Feel free to omit this training-step and use our trained weights directly.

Examples

We attach an example tagger, which can be used in an interactive manner. python3 -i tagger.py

Sourrond the preposition, for which you like to know the meaning of, with <head>...</head> and feed it to the tagger:

>>> tagger.tag("I am <head>in</head> big trouble")
Predicted Meaning: Indicating a state/condition/form, often a mental/emotional one that is being experienced 

>>> tagger.tag("I am speaking <head>in</head> portuguese.")
Predicted Meaning: Indicating the language, medium, or means of encoding (e.g., spoke in German)

>>> tagger.tag("He is swimming <head>with</head> his hands.")
Predicted Meaning: Indicating the means or material used to perform an action or acting as the complement of similar participle adjectives (e.g., crammed with, coated with, covered with)

>>> tagger.tag("She blinked <head>with</head> confusion.")
Predicted Meaning: Because of / due to (the physical/mental presence of) (e.g., boiling with anger, shining with dew)

References

[1] Vaswani, Ashish et al. (2017). Attention is all you need. Advances in neural information processing systems. P. 5998--6008.

[2] Litkowski, Kenneth C and Hargraves, Orin (2007). SemEval-2007 Task 06: Word-sense disambiguation of prepositions. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). P. 24--29

[3] Litkowski, Ken. (2013). Preposition disambiguation: Still a problem. CL Research, Damascus, MD.

[4] Gonen, Hila and Goldberg, Yoav. (2016). Semi supervised preposition-sense disambiguation using multilingual data. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. P. 2718--2729

[5] Gong, Hongyu and Mu, Jiaqi and Bhat, Suma and Viswanath, Pramod (2018). Preposition Sense Disambiguation and Representation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. P. 1510--1521

Owner
Dirk Neuhäuser
Dirk Neuhäuser
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Codes for coreference-aware machine reading comprehension

Data and code for the paper "Tracing Origins: Coreference-aware Machine Reading Comprehension" at ACL2022. Dataset There are three folders for our thr

11 Sep 29, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Linear Multihead Attention (Linformer) PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer:

Kui Xu 58 Dec 23, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022
A fast and lightweight python-based CTC beam search decoder for speech recognition.

pyctcdecode A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support

Kensho 315 Dec 21, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Code voor mijn Master project omtrent VideoBERT

Code voor masterproef Deze repository bevat de code voor het project van mijn masterproef omtrent VideoBERT. De code in deze repository is gebaseerd o

35 Oct 18, 2021
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
Multi Task Vision and Language

12-in-1: Multi-Task Vision and Language Representation Learning Please cite the following if you use this code. Code and pre-trained models for 12-in-

Meta Research 711 Jan 08, 2023
Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

NAME inflect.py - Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. SYNOPSIS import inflect p = in

Jason R. Coombs 762 Dec 29, 2022
📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

Daulet Nurmanbetov 72 Dec 30, 2022
PUA Programming Language written in Python.

pua-lang PUA Programming Language written in Python. Installation git clone https://github.com/zhaoyang97/pua-lang.git cd pua-lang pip install . Try

zy 4 Feb 19, 2022