This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Overview

Word-Level Coreference Resolution

This is a repository with the code to reproduce the experiments described in the paper of the same name, which was accepted to EMNLP 2021. The paper is available here.

Table of contents

  1. Preparation
  2. Training
  3. Evaluation

Preparation

The following instruction has been tested with Python 3.7 on an Ubuntu 20.04 machine.

You will need:

  • OntoNotes 5.0 corpus (download here, registration needed)
  • Python 2.7 to run conll-2012 scripts
  • Java runtime to run Stanford Parser
  • Python 3.7+ to run the model
  • Perl to run conll-2012 evaluation scripts
  • CUDA-enabled machine (48 GB to train, 4 GB to evaluate)
  1. Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:

     tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
    
  2. Switch to Python 2.7 environment (where python would run 2.7 version). This is necessary for conll scripts to run correctly. To do it with with conda:

     conda create -y --name py27 python=2.7 && conda activate py27
    
  3. Run the conll data preparation scripts (~30min):

     sh get_conll_data.sh ontonotes-release-5.0 data
    
  4. Download conll scorers and Stanford Parser:

     sh get_third_party.sh
    
  5. Prepare your environment. To do it with conda:

     conda create -y --name wl-coref python=3.7 openjdk perl
     conda activate wl-coref
     python -m pip install -r requirements.txt
    
  6. Build the corpus in jsonlines format (~20 min):

     python convert_to_jsonlines.py data/conll-2012/ --out-dir data
     python convert_to_heads.py
    

You're all set!

Training

If you have completed all the steps in the previous section, then just run:

python run.py train roberta

Use -h flag for more parameters and CUDA_VISIBLE_DEVICES environment variable to limit the cuda devices visible to the script. Refer to config.toml to modify existing model configurations or create your own.

Evaluation

Make sure that you have successfully completed all steps of the Preparation section.

  1. Download and save the pretrained model to the data directory.

     https://www.dropbox.com/s/vf7zadyksgj40zu/roberta_%28e20_2021.05.02_01.16%29_release.pt?dl=0
    
  2. Generate the conll-formatted output:

     python run.py eval roberta --data-split test
    
  3. Run the conll-2012 scripts to obtain the metrics:

     python calculate_conll.py roberta test 20
    
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
ACL'22: Structured Pruning Learns Compact and Accurate Models

☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models This repository contains the code and pruned models for our ACL'22 paper Structur

Princeton Natural Language Processing 130 Jan 04, 2023
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021