Python wrapper for Stanford CoreNLP tools v3.4.1

Overview

Python interface to Stanford Core NLP tools v3.4.1

This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.

  • Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named-entity recognition, and coreference resolution.
  • Runs an JSON-RPC server that wraps the Java server and outputs JSON.
  • Outputs parse trees which can be used by nltk.

It depends on pexpect and includes and uses code from jsonrpc and python-progressbar.

It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on Core NLP tools version 3.4.1 released 2014-08-27.

Download and Usage

To use this program you must download and unpack the compressed file containing Stanford's CoreNLP package. By default, corenlp.py looks for the Stanford Core NLP folder as a subdirectory of where the script is being run. In other words:

sudo pip install pexpect unidecode
git clone git://github.com/dasmith/stanford-corenlp-python.git
cd stanford-corenlp-python
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
unzip stanford-corenlp-full-2014-08-27.zip

Then launch the server:

python corenlp.py

Optionally, you can specify a host or port:

python corenlp.py -H 0.0.0.0 -p 3456

That will run a public JSON-RPC server on port 3456.

Assuming you are running on port 8080, the code in client.py shows an example parse:

import jsonrpc
from simplejson import loads
server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
                             jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))

result = loads(server.parse("Hello world.  It is so beautiful"))
print "Result", result

That returns a dictionary containing the keys sentences and coref. The key sentences contains a list of dictionaries for each sentence, which contain parsetree, text, tuples containing the dependencies, and words, containing information about parts of speech, recognized named-entities, etc:

{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
                 u'text': u'Hello world!',
                 u'tuples': [[u'dep', u'world', u'Hello'],
                             [u'root', u'ROOT', u'world']],
                 u'words': [[u'Hello',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'5',
                              u'Lemma': u'hello',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'UH'}],
                            [u'world',
                             {u'CharacterOffsetBegin': u'6',
                              u'CharacterOffsetEnd': u'11',
                              u'Lemma': u'world',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'NN'}],
                            [u'!',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'12',
                              u'Lemma': u'!',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]},
                {u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
                 u'text': u'It is so beautiful.',
                 u'tuples': [[u'nsubj', u'beautiful', u'It'],
                             [u'cop', u'beautiful', u'is'],
                             [u'advmod', u'beautiful', u'so'],
                             [u'root', u'ROOT', u'beautiful']],
                 u'words': [[u'It',
                             {u'CharacterOffsetBegin': u'14',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'it',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'PRP'}],
                            [u'is',
                             {u'CharacterOffsetBegin': u'17',
                              u'CharacterOffsetEnd': u'19',
                              u'Lemma': u'be',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'so',
                             {u'CharacterOffsetBegin': u'20',
                              u'CharacterOffsetEnd': u'22',
                              u'Lemma': u'so',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'RB'}],
                            [u'beautiful',
                             {u'CharacterOffsetBegin': u'23',
                              u'CharacterOffsetEnd': u'32',
                              u'Lemma': u'beautiful',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'JJ'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'32',
                              u'CharacterOffsetEnd': u'33',
                              u'Lemma': u'.',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]}],
u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}

To use it in a regular script (useful for debugging), load the module instead:

from corenlp import *
corenlp = StanfordCoreNLP()  # wait a few minutes...
corenlp.parse("Parse this sentence.")

The server, StanfordCoreNLP(), takes an optional argument corenlp_path which specifies the path to the jar files. The default value is StanfordCoreNLP(corenlp_path="./stanford-corenlp-full-2014-08-27/").

Coreference Resolution

The library supports coreference resolution, which means pronouns can be "dereferenced." If an entry in the coref list is, [u'Hello world', 0, 1, 0, 2], the numbers mean:

  • 0 = The reference appears in the 0th sentence (e.g. "Hello world")
  • 1 = The 2nd token, "world", is the headword of that sentence
  • 0 = 'Hello world' begins at the 0th token in the sentence
  • 2 = 'Hello world' ends before the 2nd token in the sentence.

Questions

Stanford CoreNLP tools require a large amount of free memory. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing -Xmx3g to -Xmx2g or even less. If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:

java -cp stanford-corenlp-2014-08-27.jar:stanford-corenlp-3.4.1-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties

You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available on my webpage).

License & Contributors

This is free and open source software and has benefited from the contribution and feedback of others. Like Stanford's CoreNLP tools, it is covered under the GNU General Public License v2 +, which in short means that modifications to this program must maintain the same free and open source distribution policy.

I gratefully welcome bug fixes and new features. If you have forked this repository, please submit a pull request so others can benefit from your contributions. This project has already benefited from contributions from these members of the open source community:

Thank you!

Related Projects

Maintainers of the Core NLP library at Stanford keep an updated list of wrappers and extensions. See Brendan O'Connor's stanford_corenlp_pywrapper for a different approach more suited to batch processing.

Owner
Dustin Smith
Dustin Smith
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
A look-ahead multi-entity Transformer for modeling coordinated agents.

baller2vec++ This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling

Michael A. Alcorn 30 Dec 16, 2022
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
Seonghwan Kim 24 Sep 11, 2022
pytorch implementation of Attention is all you need

A Pytorch Implementation of the Transformer: Attention Is All You Need Our implementation is largely based on Tensorflow implementation Requirements N

230 Dec 07, 2022