PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Last update: Dec 02, 2021

Related tags

Overview

PASTRIE

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.

Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

English
French
German
Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.

Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

.conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
.json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
.govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
- PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
- PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
- In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
- PASTRIE lacks enhanced dependencies.
Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
Noun and verb expressions in PASTRIE do not have supersense labels.

Comments

Misc. annotation errors and/or conversion script bugs

There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

vs mistagged as a noun--should be prep

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

ditto

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:

13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

Relevant span of code:

            if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                ('ADP','P'),('ADV','P'),('SCONJ','P'),
                ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                ('PART','POSS')}:
                # most often, the single-word lexcat should match its upos
                # check a list of exceptions
                mismatchOK = False
                if xpos=='TO' and lc.startswith('INF'):
                    mismatchOK = True
                elif (xpos=='TO')!=lc.startswith('INF'):
                    assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                    mismatchOK = True

Originator as function:

(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

lexcat DISC with ADJ:

AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

"her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:

1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

"NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

opened by lgessler 6

Prepositional supersense annotations on non-preposition targets
Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

21 give give VERB VB _ 10 conj _ _ 2:1 _ give up on p.Theme p.Theme _ _ _ _ 22 up up ADP RP _ 21 compound:prt _ _ 2:2 _ _ _ _ _ _ _ _ 23 on on ADP IN _ 24 case _ _ 2:3 _ _ _ _ _ _ _ _
opened by lgessler 5

Prepositions unannotated for supersense

Token 6:

# sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
# text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_

I assumed that all preps were supposed to be annotated, but perhaps not?

opened by lgessler 3

Apostrophes removed in preprocessing?

Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

opened by nschneid 2
Dataset requested

Hi all,

I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

Thanks for reply.

opened by fj-morales 2
SNACS supersense tags should start with "p."

For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

opened by nschneid 0
Questionable adpositional MWEs
in_male_term — from "in male terms"; should be in_term (at most)

in_the_first_place

in_my_hand — from "in my hands"; should be in_hand (at most)

for_quite_some_time — just Duration for, weak MWE?

at_all_time — from what should have been "at all times". OK?

on_a_smaller_scale — omit adjective?

withouth — typo

see_as — "seeing as" (deverbal MWE acting like a preposition)
opened by nschneid 0
Some undersegmentation of sentences

Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

opened by nschneid 0

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)
Fixes 3 erroneous sentence IDs (along with beefed up sentence ID validation in scripts). (#16)

Source code(tar.gz)
Source code(zip)
v2.0(Oct 22, 2021)
Switch to full .conllulex format following STREUSLE

add lexcats (#3), morphological features, newdoc directives

Scripts for validation and format conversion

Clean up various annotation issues, including:

restore apostrophes and fixing other conversion problems (#6, #9)

include pretokenized raw text (#12)

Source code(tar.gz)
Source code(zip)
v1.0.1(Dec 14, 2020)
Added .json file format

Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1

Corrected rare encoding issue from v1.0

Source code(tar.gz)
Source code(zip)
v1.0(Dec 12, 2020)

Source code(tar.gz)
Source code(zip)

Owner

NERT @ Georgetown

GitHub Repository

OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

211 Jan 02, 2023

Jittor implementation of PCT:Point Cloud Transformer

PCT: Point Cloud Transformer This is a Jittor implementation of PCT: Point Cloud Transformer.

547 Jan 03, 2023

face_recognization (FaceNet) + TFHE (HNP) + hand_face_detection (Mediapipe)

SuperControlSystem Face_Recognization (FaceNet) 面部识别 (FaceNet) Fully Homomorphic Encryption over the Torus (HNP) 环面全同态加密 (TFHE) Hand_Face_Detection (M

2 Dec 30, 2021

Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Info This is the code repository of the work Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation from Elias T

2 Apr 20, 2022

Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

NeuralPDE NeuralPDE.jl is a solver package which consists of neural network solvers for partial differential equations using scientific machine learni

680 Jan 02, 2023

Neural Contours: Learning to Draw Lines from 3D Shapes (CVPR2020)

Neural Contours: Learning to Draw Lines from 3D Shapes This repository contains the PyTorch implementation for CVPR 2020 Paper "Neural Contours: Learn

93 Dec 16, 2022

Additional environments compatible with OpenAI gym

Decentralized Control of Quadrotor Swarms with End-to-end Deep Reinforcement Learning A codebase for training reinforcement learning policies for quad

40 Dec 06, 2022

Predictive AI layer for existing databases.

MindsDB is an open-source AI layer for existing databases that allows you to effortlessly develop, train and deploy state-of-the-art machine learning

12.2k Jan 03, 2023

Implementing Vision Transformer (ViT) in PyTorch

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

2 Dec 24, 2021

Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models

Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models This repo contains a barebones implementation for the atta

16 Dec 04, 2022

Codebase for Time-series Generative Adversarial Networks (TimeGAN)

532 Dec 31, 2022

source code of “Visual Saliency Transformer” (ICCV2021)

Visual Saliency Transformer (VST) source code for our ICCV 2021 paper “Visual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, an

89 Dec 21, 2022

Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

47 Jul 26, 2022

[ICCV 2021] Deep Hough Voting for Robust Global Registration

Deep Hough Voting for Robust Global Registration, ICCV, 2021 Project Page | Paper | Video Deep Hough Voting for Robust Global Registration Junha Lee1,

57 Nov 28, 2022

PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

546 Dec 05, 2022

[ICCV 2021] Group-aware Contrastive Regression for Action Quality Assessment

CoRe Created by Xumin Yu*, Yongming Rao*, Wenliang Zhao, Jiwen Lu, Jie Zhou This is the PyTorch implementation for ICCV paper Group-aware Contrastive

31 Dec 24, 2022

This repository contains PyTorch models for SpecTr (Spectral Transformer).

SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation This repository contains PyTorch models for SpecTr (Spectral Transformer).

45 Dec 13, 2022

sktime companion package for deep learning based on TensorFlow

NOTE: sktime-dl is currently being updated to work correctly with sktime 0.6, and wwill be fully relaunched over the summer. The plan is Refactor and

573 Jan 05, 2023

🐾 Semantic segmentation of paws from cute pet images (PyTorch)

🐾 paw-segmentation 🐾 Semantic segmentation of paws from cute pet images 🐾 Semantic segmentation of paws from cute pet images (PyTorch) 🐾 Paw Segme

3 Feb 01, 2022

Point-NeRF: Point-based Neural Radiance Fields

Point-NeRF: Point-based Neural Radiance Fields Project Sites | Paper | Primary c

662 Jan 01, 2023

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Overview

PASTRIE

Overview

Annotation Example

Data Formats

Comments

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)

v2.0(Oct 22, 2021)

v1.0.1(Dec 14, 2020)

v1.0(Dec 12, 2020)

Owner

NERT @ Georgetown

OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

Jittor implementation of PCT:Point Cloud Transformer

face_recognization (FaceNet) + TFHE (HNP) + hand_face_detection (Mediapipe)

Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Physics-Informed Neural Networks (PINN) and Deep BSDE Solvers of Differential Equations for Scientific Machine Learning (SciML) accelerated simulation

Neural Contours: Learning to Draw Lines from 3D Shapes (CVPR2020)

Additional environments compatible with OpenAI gym

Predictive AI layer for existing databases.

Implementing Vision Transformer (ViT) in PyTorch

Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models

Codebase for Time-series Generative Adversarial Networks (TimeGAN)

source code of “Visual Saliency Transformer” (ICCV2021)

Implementation of Multistream Transformers in Pytorch

[ICCV 2021] Deep Hough Voting for Robust Global Registration

PyTorch implementations of algorithms for density estimation

[ICCV 2021] Group-aware Contrastive Regression for Action Quality Assessment

This repository contains PyTorch models for SpecTr (Spectral Transformer).

sktime companion package for deep learning based on TensorFlow

🐾 Semantic segmentation of paws from cute pet images (PyTorch)

Point-NeRF: Point-based Neural Radiance Fields