Stanford CoreNLP provides a set of natural language analysis tools written in Java

Overview

Stanford CoreNLP

Build Status Maven Central Twitter

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, and Spanish. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

Build Instructions

Several times a year we distribute a new version of the software, which corresponds to a stable commit.

During the time between releases, one can always use the latest, under development version of our code.

Here are some helpful instructions to use the latest code:

Provided build

Sometimes we will provide updated jars here which have the latest version of the code.

At present, the current released version of the code is our most recent released jar, though you can always build the very latest from GitHub HEAD yourself.

Build with Ant

  1. Make sure you have Ant installed, details here: http://ant.apache.org/
  2. Compile the code with this command: cd CoreNLP ; ant
  3. Then run this command to build a jar with the latest version of the code: cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
  4. This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
  5. The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
  6. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.

Build with Maven

  1. Make sure you have Maven installed, details here: https://maven.apache.org/
  2. If you run this command in the CoreNLP directory: mvn package , it should run the tests and build this jar file: CoreNLP/target/stanford-corenlp-4.4.0.jar
  3. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-extra-models, and english-kbp-models and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
  4. If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install stanford-corenlp-models-current.jar you will need to set -Dclassifier=models. Here is the sample command for Spanish: mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=4.4.0 -Dclassifier=models-spanish -Dpackaging=jar

Models

The models jars that correspond to the latest code can be found in the table below.

Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar. These require downloading the English (extra) and English (kbp) jars. Resources for other languages require usage of the corresponding models jar.

The best way to get the models is to use git-lfs and clone them from Hugging Face Hub.

For instance, to get the French models, run the following commands:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/stanfordnlp/corenlp-french

The jars can be directly downloaded from the links below or the Hugging Face Hub page as well.

Language Model Jar Last Updated
Arabic download (HF Hub) 4.4.0
Chinese download (HF Hub) 4.4.0
English (extra) download (HF Hub) 4.4.0
English (KBP) download (HF Hub) 4.4.0
French download (HF Hub) 4.4.0
German download (HF Hub) 4.4.0
Hungarian download (HF Hub) 4.4.0
Italian download (HF Hub) 4.4.0
Spanish download (HF Hub) 4.4.0

Thank you to Hugging Face for helping with our hosting!

Useful resources

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.

Comments
  • An Issue in importing StanfordCoreNLP library in an Android Studio project

    An Issue in importing StanfordCoreNLP library in an Android Studio project

    I am developing an Android application (I am a beginner). I want to use Stanford CoreNPL 3.8.0 library in my app to extract the part of speech, the lemma, the parser and so on from the user sentences.I have tried a simple java code in NetBeans by following this youtube tutorial https://www.youtube.com/watch?v=9IZsBmHpK3Y, and it is working perfectly.The jar files that I imported to the NetBeans project are: stanford-corenlp-3.8.0.jar and stanford-corenlp-3.8.0-models.jar.

    And this is the java source code:

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.util.CoreMap;
    
    import java.util.List;
    import java.util.Properties;
    
    public class CoreNlpExample {
    
        public static void main(String[] args) {
    
            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            // read some text in the text variable
            String text = "What is the Weather in Bangalore right now?";
    
            // create an empty Annotation just with the given text
            Annotation document = new Annotation(text);
    
            // run all Annotators on this text
            pipeline.annotate(document);
    
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    
            for (CoreMap sentence : sentences) {
                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods
                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
    
                    System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
                }
            }
        }
    }
    

    I wanted to try the same code in Android Studio and display the result in a textview, but I am facing a problem with adding these external libraries in my Android Studio 3.0.1 project.

    I have read on some websites that I need to reduce the size of the jar files, and I did that and made sure that the reduced jars are still working fine in the Netbeans project. But I am still facing problems in Android studio and this is the error that I am getting:

    java.lang.VerifyError: Rejecting class edu.stanford.nlp.pipeline.StanfordCoreNLP that attempts to sub-type erroneous class edu.stanford.nlp.pipeline.AnnotationPipeline (declaration of 'edu.stanford.nlp.pipeline.StanfordCoreNLP' appears in /data/app/com.example.fatimah.nlpapplication-bhlUJOCUwLhSbkWE7NBERA==/split_lib_dependencies_apk.apk)

    Any suggestions on how I can fix this and import Stanford library successfully?

    opened by ftoom235 52
  • Use JaFaMa for faster math, and optimize critical code paths

    Use JaFaMa for faster math, and optimize critical code paths

    These changes substantially cut down the processing time; by several hours when I process all of Wikipedia. Feel free to benchmark on your own data.

    The first commit uses JaFaMa instead of java.lang.Math, which is 2-3x faster for exp, log: http://blog.element84.com/improving-java-math-perf-with-jafama.html In some places I switched back to log1p, because the runtime of log and log1p in JaFaMa are similar, and log1p offers better precision for small values of x than log(1+x).

    The other patches optimize the crucial code around the Viterbi algorithm:

    • HotSpot optimizes better if large functions with multiple loops are split into multiple methods (as they can be recompiled independently).
    • It pays off to save repeated nested array lookups (e.g. array[i][j] in a loop over j; move array_i = array[i] outside of the loop and use array_i[j] inside).
    • I also add a cache to avoid recomputing the open tags set in Ttags.

    All of these may appear to be trivial changes, but once you benchmark you will see how much this improves the run time.

    Processing the first 20000 articles with tokenize,ssplit,pos, doing some further processing such as my own lemmatization based on hunspell, and then loading them into a lucene index with the CoreNLP master branch took 08:51 minutes, and with my patches only 04:38 minutes (sloopy benchmark only). I consider this a substantial speedup, because Wikipedia is 5.3 million articles, and it still needed 19 hours to build the full text index, but it used to take almost two days...

    opened by kno10 39
  • Could the project switch to using log4j for logs?

    Could the project switch to using log4j for logs?

    I see a lot of logs printed to System.out or System.err. Would it be possible to use a library like log4j http://logging.apache.org/log4j/2.x/ and use log.error, log.warning, log.info, log.debug instead? That would make it easier for users of the StanfordCoreNLP to manage which logs should be printed by choosing the log level of the project.

    enhancement 
    opened by Asimov4 33
  • Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Hello,

    I had a situation with text that had this: ""=

    It seems to throw an error when I try running the pipeline with quote annotation on this small fragment. Just wanted to verify that it was an issue.

    Thank you.

    opened by allenkim 29
  • Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Happens with the following sentence, under version 3.9.2, only when adding openIE annotator:

    It was a long and stern face, but with eyes that twinkled in a kindly way.

    stack trace:

    java.lang.AssertionError at edu.stanford.nlp.naturalli.Util.cleanTree(Util.java:324) at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:463) at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$2(OpenIE.java:547) at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:547) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

    to replicate:

            Properties props = new Properties();
            props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            String text = "It was a long and stern face, but with eyes that twinkled in a kindly way.";
    
            CoreDocument document = new CoreDocument(text);
            pipeline.annotate(document);
    

    works fine if openie is disabled, with other sentences, or when using https://corenlp.run/ so looks like it's fixed in later versions but I did not verify it locally as I can't upgrade at the moment anyway.

    advice much appreciated

    opened by manzurola 29
  • Stanford CoreNLP server not responding

    Stanford CoreNLP server not responding

    I have been trying to use the CoreNLP server using various python packages including Stanza. I am always running into the same problem that I do not hear back from the server.

    I downloaded a copy of CoreNLP from the website. I then try to start a server from the terminal and go to my localhost as described here. Based on the documentation I should see something when I go to http://localhost:9000/, but nothing loads up.

    Here are to commands I use:

    cd stanford-corenlp-full-2018-10-05/
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    

    Here is the output of running the commands above:

    Samarths-MacBook-Pro-2:stanford-corenlp-full-2018-10-05 samarthbhandari$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - setting default constituency parser
    [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
    [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
    [main] INFO CoreNLP - to use shift reduce parser download English models jar from:
    [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
    [main] INFO CoreNLP -     Threads: 8
    [main] INFO CoreNLP - Starting server...
    [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
    

    I then go to http://localhost:9000/, nothing loads up. Like I mentioned above originally I have been trying to do the same thing using some of the python packages and observed similar behavior.

    Here is a stack overflow post related to server not responding using Stanza.

    OS: MacOS 10.15.4 Python: 3.7.7 Java: 1.8

    cantreproduce 
    opened by samarth12 25
  • [MEMORY] Possibly use float instead of double in models/weights

    [MEMORY] Possibly use float instead of double in models/weights

    double

    double arrays are a large portion of the heap.

    There are some places with 2d double arrays with dimensions like

    345k x 16, 150k x 24, 80k x 46: CRFCLassifier.weights 100k x 1000: Classifier.saved in DependecyParser 60k x 50: Classifier.E, .eg2E 1000x2400: Classifier.W1, .wg2W1

    Most are weights of some sort, making me wonder if they could be stored in less than 64bit each.

    The obvious step would be to use float[], halving the memory use of this portion.

    Another would be to encode weights in something else, for example a small integer and scale that into a float again when using the weight.

    Machine Learning models often use fp16 or even fp8 to store weights, there are java implementations of float -> short -> float (with fp16 semantics stored in a 16bit short)

    like https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/util/Half.java with https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/libcore/util/FP16.java

    or https://stackoverflow.com/questions/6162651/half-precision-floating-point-in-java

    The latter approached would need some performance testing as each time a weight is used it would have to be converted first.


    I saw that some models serialize themselves using ObjectStreams, that would need an adapter to deserialize to double[] first and then array-cast it to float[].

    Like in CRFClassifier.loadClassifier

    opened by lambdaupb 25
  • TokenSequenceParser ignoring tail of patterns mentioned in rules

    TokenSequenceParser ignoring tail of patterns mentioned in rules

    Following function in TokenSequenceParser class ignores tail of patterns defined in rules for tokensregex

    private String getStringFromTokens(Token head, Token tail, boolean includeSpecial) { StringBuilder sb = new StringBuilder(); for( Token p = head ; p != tail ; p = p.next ) { if (includeSpecial) { appendSpecialTokens( sb, p.specialToken ); } sb.append( p.image ); } return sb.toString(); }

    Eg: ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}]) gets converted to ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}] while reading and don't provide intended matches

    opened by ankitsingh2 23
  • Exception thrown for operation attempted on unknown vertex

    Exception thrown for operation attempted on unknown vertex

    CoreNLP version 4.5.0 using pos lemma depparse. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run some semgraph rules against it. We got a case where it throws an exception like this

    Caused by: edu.stanford.nlp.semgraph.UnknownVertexException: Operation attempted on unknown vertex happens/VBZ'''' in graph -> observed/VBD (root)
      -> 24/CD (nsubj)
        -> response/NN (nmod:in)
          -> In/IN (case)
          -> CoV/NNP (nmod:to)
            -> to/IN (case)
            -> SARS/NNP (compound)
            -> ‐/SYM (dep)
            -> ‐/SYM (dep)
            -> peptides/NNS (dep)
              -> 2/CD (nummod)
      -> ,/, (punct)
      -> we/PRP (nsubj)
      -> unexpectedly/RB (advmod)
      -> associated/VBN (ccomp)
        -> that/IN (mark)
        -> sirolimus/NN (nsubj:pass)
        -> was/VBD (aux:pass)
        -> significantly/RB (advmod)
        -> release/NN (obl:with)
          -> with/IN (case)
          -> a/DT (det)
          -> proinflammatory/JJ (amod)
          -> cytokine/NN (compound)
          -> levels/NNS (nmod:including)
            -> including/VBG (case)
            -> higher/JJR (amod)
            -> α/NN (nmod:of)
              -> of/IN (case)
              -> TNF/NN (compound)
              -> ‐/SYM (dep)
              -> IL/NN (conj:and)
                -> and/CC (cc)
            -> IL/NN (nmod:of)
            -> 1β/NN (nmod)
              -> ‐/SYM (dep)
      -> ./. (punct)
    
    	at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1084)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.<init>(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:339)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.resetChildIter(SemgrexMatcher.java:80)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:363)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:457)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:574)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
    	at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
    	at az.bikg.nlp.etl.common.nlp.Pattern.$anonfun$findCauseEffectMatches$6(Pattern.scala:268)
    	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    	at az.bikg.nlp.etl.common.nlp.Pattern.findCauseEffectMatches(Pattern.scala:266)
    	at az.bikg.nlp.etl.steps.ERs$.findRelations(ERs.scala:107)
    	at az.bikg.nlp.etl.steps.ERs$.findRelationsSpark(ERs.scala:229)
    	at az.bikg.nlp.etl.steps.ERs$.$anonfun$extractERs$1(ERs.scala:242)
    	... 28 more
    

    Am I doing anything wrong because of this exception? It didn't happen with version 4.4.0.

    opened by mkarmona 22
  • Are these latest Chines model significantly worse than the Stanford online parser?

    Are these latest Chines model significantly worse than the Stanford online parser?

    I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

    我的朋友:always tags "我的" as one NN token. 我的狗吃苹果: ‘我的狗’ tagged as one NN token. 他的狗吃苹果:'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

    When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

    opened by lingvisa 21
  • pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    I use Stanford Core NLP in Java as a Maven dependency. I want to use the MaxentTagger with a model supplied in the stanford-corenlp-3.5.2-models package. The problem is that I cannot access this model through classpath.

    My code is

    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    The file "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" exists in the jar and should be loaded through classpath, but the following exception is thrown

    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:298)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:263)
        at cz.zcu.kiv.nlp.semeval.cwi.features.POSFeature.<init>(POSFeature.java:24)
        at cz.zcu.kiv.nlp.semeval.cwi.CWIModel.train(CWIModel.java:60)
        at cz.zcu.kiv.nlp.semeval.cwi.TrainingCrossValidation.main(TrainingCrossValidation.java:51)
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
        at  edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765)
        ... 5 more
    

    If I copy the model out of the jar and use e.g.

    tagger = new MaxentTagger("./english-left3words-distsim.tagger");
    

    then everything works perfectly.

    The problem is probably in the class IOUtils, method findStreamInClasspathOrFileSystem(String name).

    In the line

    InputStream is = IOUtils.class.getClassLoader().getResourceAsStream(name);
    

    the returned classloader is probably the JarClassLoader which loaded the library (stanford-corenlp-3.5.2.jar) and it does not have access to other libraries.

    This theory is supported by the following code

    InputStream stream = POSFeature.class.getResourceAsStream("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    System.out.println("Stream == null: " + (stream == null));
    
    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    which outputs

    Stream == null: false
    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
    ...
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
    
    opened by konkol 21
  • Why is there no description of how to set up the models jar with a build tool?

    Why is there no description of how to set up the models jar with a build tool?

    On README, the way of how to install models jars is written but a method using build tools(e.g. Gradle) is not written. However, I try this way( https://stackoverflow.com/a/68859054/3809427 ) and succeeded. Why don't you write this useful method?

    opened by lamrongol 3
  • EntityMentions returns null instead of empty list

    EntityMentions returns null instead of empty list

    I ran into an issue where if an empty string is passed in then getting the entityMentions returns null instead of an empty list like I figure out be standard practice.

    Example code:

    StanfordCoreNLP processor = new StanfordCoreNLP(props);
    CoreDocument nlpDocument = new CoreDocument("");
    
    nlpProcessor.annotate(nlpDocument);
    List<CoreEntityMention>  entities = nlpDocument.entityMentions(); <== returns null
    

    Just wanted to know if there is any light that can be shed on this. If this is expected behavior then I will try my best to document it in the documentation

    opened by cholojuanito 2
  • 'email' tokenizing as 'em, ail, and '

    'email' tokenizing as 'em, ail, and '

    In the following sentence (from Twitter), 'email' is being tokenized as 'em, ail, and '. This is obviously incorrect. What can be done to stop this split?

    • It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!

    I have the following parameters set: tokenize.language: English tokenize.whitespace: false (because we want tokens like it's to separate into it and 's) tokenize.keepeol: false tokenize.verbose: false tokenize.options: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original

    opened by saxtell-cb 6
  • parsing '`'

    parsing '`'

    curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'
    

    Gives me:

    {
      "sentences": [
        {
          "index": 0,
          "tokens": [
            {
              "index": 1,
              "word": "`",
              "originalText": "`",
              "lemma": "`",
              "characterOffsetBegin": 0,
              "characterOffsetEnd": 1,
              "pos": "``",
              "before": "",
              "after": ""
            }
          ]
        }
      ]
    }
    

    With the standard English model. Is this expected? I'm particularly surprised at the POS.

    opened by AntonOfTheWoods 5
  • Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    A unit test that runs in a loop calling pipeline.annotate(document) appears to be taking about 50% longer. Our configuration properties didn't change during the upgrade, but maybe some new properties have been added in 4.5.1? Below is what we have. Is there a way to determine which annotator is using more time now?

    customAnnotatorClass.tokensregex=edu.stanford.nlp.pipeline.TokensRegexAnnotator sutime.binders=0 tokensregex.rules= .... (omitted) ssplit.eolonly=false customAnnotatorClass.tokenOverride_en= .... (omitted) annotators=tokenize, ssplit, tokenOverride_en, pos, lemmaOverride_en, ner, tokensregex, entitymentions, parse language=en tokenize.whitespace=false customAnnotatorClass.lemmaOverride_en=.... (omitted) tokenize.options=untokenizable=allKeep,americanize=false ssplit.isOneSentence=true nermention.acronyms=true

    opened by dsbanks99 1
Releases(v4.5.1)
  • v4.5.1(Aug 30, 2022)

    CoreNLP 4.5.1

    Bugfixes!

    • Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word https://github.com/stanfordnlp/CoreNLP/commit/974383ab7336a254d260264885186dd77df0cf81
    • Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335
    • Fix \r\n not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3
    • Handle one half of surrogate character pairs in the tokenizer w/o crashing https://github.com/stanfordnlp/CoreNLP/issues/1298 https://github.com/stanfordnlp/CoreNLP/commit/1b12faa64b9ea85f808b27ab74ccf9f79ccb01f4
    • Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: https://github.com/stanfordnlp/CoreNLP/issues/1296 https://github.com/stanfordnlp/CoreNLP/issues/1229 https://github.com/stanfordnlp/CoreNLP/issues/1169 https://github.com/stanfordnlp/CoreNLP/commit/f99b5ab87f073118a971c4d1e39df85ab9abbab1
    Source code(tar.gz)
    Source code(zip)
  • v4.5.0(Jul 22, 2022)

    CoreNLP 4.5.0

    Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

    • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9

    • log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf

    • Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41

    • Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44

    • Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb

    • Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c

    • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259

    • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263

    • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553

    • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17

    • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266

    • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8

    • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d

    • Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279

    Source code(tar.gz)
    Source code(zip)
  • v4.4.0(Jan 25, 2022)

    Enhancements

    • added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline

    • tsurgeon CLI - python side added to stanza
      https://github.com/stanfordnlp/CoreNLP/pull/1240

    • sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0

    Fixes

    • rebuilt Italian dependency parser using CoreNLP predicted tags

    • XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241

    • NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87

    • fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238

    • json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894

    • fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082

    • fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0

    • workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253

    • make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22

    Source code(tar.gz)
    Source code(zip)
  • v4.3.2(Nov 18, 2021)

  • v4.3.1(Oct 22, 2021)

    Fixes

    • character offset issue with StatTok
    • fixes path issue with default Hungarian properties
    • adds Hungarian and Italian to demo
    • fixes umlaut issue
    Source code(tar.gz)
    Source code(zip)
  • v4.3.0(Oct 6, 2021)

    Overview

    This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

    Enhancements

    • Hungarian pipeline
    • Italian pipeline
    • Improvements to English tokenizer
    • Better memory usage by dependency parser

    Fixes

    • issue with umlaut handling in German #1184
    Source code(tar.gz)
    Source code(zip)
  • v4.2.2(May 14, 2021)

    This release includes some small fixes to version 4.2.1.

    It includes:

    • demo fixes for 4.2.2, resolving cache issues with demo resources
    • small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten
    Source code(tar.gz)
    Source code(zip)
  • v4.2.1(May 5, 2021)

    Fix the server having some links http instead of https https://github.com/stanfordnlp/CoreNLP/issues/1146

    Improve MWE expressions in the enhanced dependency conversion https://github.com/stanfordnlp/CoreNLP/commit/1ef9ef9c75e6948eed10092bf6d1c49c49cfabaa

    Add the ability for the command line semgrex processor to handle multiple calls in one process https://github.com/stanfordnlp/CoreNLP/commit/c9d50ef9cb2e1851257d06cda55b1456d69145b7

    Fix interaction between discarding tokens in ssplit and assigning NER tags https://github.com/stanfordnlp/CoreNLP/commit/a803bc357c32841beb3919f2e4dc22a1375dca4d

    Reduce the size of the sr parser models (not a huge amount, but some) https://github.com/stanfordnlp/CoreNLP/pull/1142

    Various QuoteAnnotator bug fixes https://github.com/stanfordnlp/CoreNLP/pull/1135 https://github.com/stanfordnlp/CoreNLP/issues/1134 https://github.com/stanfordnlp/CoreNLP/pull/1121 https://github.com/stanfordnlp/CoreNLP/issues/1118 https://github.com/stanfordnlp/CoreNLP/commit/9f1b015ea91f1db6dce6ab7f35aacb9cdc33e463 https://github.com/stanfordnlp/CoreNLP/issues/1147

    Switch to newer istack implementation https://github.com/stanfordnlp/CoreNLP/pull/1133 Newer protobuf https://github.com/stanfordnlp/CoreNLP/pull/1150

    Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts https://github.com/stanfordnlp/CoreNLP/commit/c70ddec9736e9d3c7effd4660f63e363caeb333d

    Fix Turkish locale enums https://github.com/stanfordnlp/CoreNLP/pull/1126 https://github.com/stanfordnlp/stanza/issues/580

    Use StringBuilder instead of StringBuffer where possible https://github.com/stanfordnlp/CoreNLP/pull/1010

    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Nov 17, 2020)

    Overview

    This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

    Enhancements

    • Upgrade libraries (EJML, JUnit, JFlex)
    • Add character offsets to Tregex responses from server
    • Improve cleaning of treebanks for English models
    • Speed up loading of Wikidict annotator
    • New utility for tagging CoNLL-U files in place
    • Command line tool for processing TokensRegex

    Fixes

    • Output single token NER entities in inline XML output format
    • Add currency symbol part of speech training data
    • Fix issues with tree binarizing
    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(May 4, 2020)

    Overview

    The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

    Enhancements

    • UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
    • Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
    • Have WhitespaceTokenizer support same newline processing as PTBTokenizer
    • New mwt annotator for handling multiword tokens in French, German, and Spanish.
    • New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
    • Add French NER
    • New Chinese segmentation based off CTB9
    • Improved handling of double codepoint characters
    • Easier syntax for specifying language specific pipelines and NER pipeline properties
    • Improved CoNLL-U processing
    • Improved speed and memory performance for CRF training
    • Tregex support in CoreSentence
    • Updated library dependencies

    Fixes

    • NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
    • NPE in EntityMentionsAnnotator during language check
    • NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
    • NPE in NERCombinerAnnotator in certain configurations of models on/off
    • Incorrect handling of eolonly option in ArabicSegmenterAnnotator
    • Apply named entity granularity change prior to coref mention detection
    • Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
    • Incorrect handling of reading in German treebank files
    • SR parser crashes when given bad training input
    • New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
    • Fix ancient bug in printing constituency tree with multiple roots.
    • Fix parser from failing on word "STOP" because it treated it as a special word
    Source code(tar.gz)
    Source code(zip)
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing(自然语言处理) Text Classificati

1.5k Nov 14, 2022
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
Deep Learning for Natural Language Processing - Lectures 2021

This repository contains slides for the course "20-00-0947: Deep Learning for Natural Language Processing" (Technical University of Darmstadt, Summer term 2021).

0 Feb 21, 2022
✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

AI-For-Road-Safety Challenge hosted by Omdena Hyderabad Chapter Original Repo Link : https://github.com/OmdenaAI/omdena-india-roadsafety Final Present

Prathima Kadari 7 Nov 29, 2022
Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022