Journalism AI – Quotes extraction for modular journalism

This repo contains the code for the Guardian and AFP contribution for the JournalismAI Festival 2021.

Further reading can be found in our blog post.

The aim of the project is to extract quotes from news articles using Named Entity Recognition, add coreferencing information and format the results for an exploratory search tool.

The contribution consists of several self-contained pieces of work, namely:

a regular expression pipeline attempting to extract quotes by matching patterns
a rule set to define different types of quotes and guide the quote annotation
custom annotation recipes for the Prodigy software enabling quick and efficient data annotation
a post-processing pipeline for extracting quotes using a trained Spacy model and adding coreferencing information
example data and data schema for displaying the extracted quote information in a search tool

Repo structure

Each folder in this repo reflects one of the pieces of work mentioned above.

regex_pipeline/ – code to run the regular expression-based quote extraction
annotation_rules/ – document with rules and definitions to guide the quote annotation step
annotation_scripts/ – custom annotation scripts for Prodigy
coreference/ – proof of concept for rules-based coreferencing tool
schema/ – data output schema and example data

Each folder contains a separate README file with instructions to set up and run each piece of work.

Journalism AI – Quotes extraction for modular journalism

Related tags

Overview

Journalism AI – Quotes extraction for modular journalism

Repo structure

Owner

Journalism AI collab 2021

Various capabilities for static malware analysis.

Pytorch-Named-Entity-Recognition-with-BERT

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Machine learning models from Singapore's NLP research community

It analyze the sentiment of the user, whether it is postive or negative.

A curated list of efficient attention modules

Sequence modeling benchmarks and temporal convolutional networks

CoSENT、STS、SentenceBERT

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

CMeEE 数据集医学实体抽取

In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3