ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Overview

ANTLR v4

Java 7+ License

Build status

Github CI Build Status (MacOSX) AppVeyor CI Build Status (Windows) Circle CI Build Status (Linux)

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Donate

Authors and major contributors

Useful information

You might also find the following pages useful, particularly if you want to mess around with the various target languages.

The Definitive ANTLR 4 Reference

Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language—ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features.

You can buy the book The Definitive ANTLR 4 Reference at amazon or an electronic version at the publisher's site.

You will find the Book source code useful.

Additional grammars

This repository is a collection of grammars without actions where the root directory name is the all-lowercase name of the language parsed by the grammar. For example, java, cpp, csharp, c, etc...

Comments
  • New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    Fixes #276 .

    This used to be a WIP PR, but it's now ready for review.

    This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

    The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

    I bumped the UUID, since this changes the serialized ATN format.

    I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

    type:feature unicode 
    opened by bhamiltoncx 115
  • Terrible Golang performance

    Terrible Golang performance

    Stackoverflow: https://stackoverflow.com/questions/72266899/golang-performance-issues

    Google group: https://groups.google.com/g/antlr-discussion/c/OdhAIsy2GfI

    Example code: https://github.com/movelazar/perf-repro

    A simple rule such as:

    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2
    

    takes exponentially longer to parse the more 1 EQ 2 OR clauses there are. This does not happen in python (by my testing) or CSharp, Dart, Java (by stackoverflow comment).

    On my machine, # of lines vs parse time:

    11: 0.5s
    12: 1.2s
    13: 3.2s
    14: 8.1s
    15: 21.9s
    16: 57.5s
    

    Given that Python doesn't face this issue I can't imagine I'm doing something terrible in my grammar.

    Issue goes away if I put parens on things but that's not a real solution.

    On 4.10.1, first noticed with 4.9.1.

    Any help is greatly appreciated. Surprised I can't find others with this issue.

    type:bug target:go comp:performance 
    opened by movelazar 101
  • splitting version numbers for targets

    splitting version numbers for targets

    Hiya: @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos

    Eric has raised the point that it would be nice to be able to make quick patches to the various runtimes; e.g., there is a stopping bug now in the JavaScript target. He proposes something along these lines:

    • any change in the tool or the runtime algorithm bumps the middle version #: 4.9 -> 4.10 -> 4.11
    • any bug fix in a runtime we bump the last digit of that runtime only: 4.9 -> 4.9.1 -> 4.9.2
    • if bumping the java runtime for bug fix we also bump the tool since it contains the runtime

    This is in optimal as people have criticized me in the past for bumping, say, 4.6 to 4.7 for some minor changes. It also has the problem that 4.9.x will not mean the same thing in two different targets possibly, as each target will now have their own version number.

    Rather than break up all of the targets into separate repositories or similar, can you guys think of a better solution? Any suggestions? The goal here is to allow more rapid target releases, and independent of me having to do a major release of the tool.

    type:question 
    opened by parrt 94
  • Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

    I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

    I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

    Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

    To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

    lexers target:java comp:performance 
    opened by bhamiltoncx 85
  • initial discussion to start integration of new targets

    initial discussion to start integration of new targets

    As promised, I am now ready to integrate the new ANTLR target languages you folks have been working on. This issue is meant to get everybody in sync, check status, and discuss the proper order of integration and resolve issues etc.

    There are two administrative details to get out of the way first:

    1. Please let me know if there is another github user that should be added to one of the categories. Or, of course, if you would like your user ID removed from this discussion.
    2. Nothing can be merged into antlr/antlr4 unless every single committer has added themselves to the contributors.txt file. It's onerous, particularly for simple commits, but it is requirement for anything merged into the master. Eclipse foundation lawyers tell me that we have one of the cleanest licenses out there and it contributes to ANTLR's widespread use because companies are not afraid to use the software. See the genesis of such heinous requirements in SCO v IBM. This means lead target authors have to go back through their committers list quickly and ask them to sign the contributors file with a new commit. Or, they can remove that commit and enter their own version of the functionality, being careful not to violate copyright on the previous.

    As we proceed, please keep in mind that I have a difficult role, balancing the needs of multiple targets and keeping discussions in the civil and practical zone. Decisions I make come from the perspective of over 25 years managing and leading this project. I look forward to incorporating your hard work into the main antlr repo.

    C++ current location

    • @mike-lischke
    • @DanMcLaughlin
    • @nburles
    • @davesisson

    Go current location, previous discussion

    • @pboyer

    Swift current location: unclear, previous discussion

    • @jeffreyguenther
    • @hanjoes
    • @janyou
    • @ewanmellor

    Likely interested/supporting humans (scraped from github issues):

    • @RYDB3RG
    • @wjkohnen
    • @willfaught
    • @parrt
    • @sharwell
    • @ericvergnaud
    type:improvement target:swift target:cpp target:go 
    opened by parrt 84
  • Add a new CharStream that converts the symbols to upper or lower case.

    Add a new CharStream that converts the symbols to upper or lower case.

    This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at https://github.com/antlr/antlr4test-maven-plugin/issues/1

    It would be used like so:

    input, _ := antlr.NewFileStream("filename")
    
    in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.
    
    lexer := parser.NewAbnfLexer(in)
    

    While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

    I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

    lexers comp:runtime target:java target:javascript target:go 
    opened by bramp 69
  • Swift Target

    Swift Target

    I did a quick search and I didn't see anything written about this yet. What's the likelihood of a Swift target for ANTLR?

    There are C#, Javascript, and Python targets at the moment.

    What does it take to implement a target? Given that Swift is more Java-like, it seems like it should be possible. Maybe start with a code translator if there is one for (Java to Swift), and iterate towards a more idiomatic implementation.

    type:question 
    opened by jeffreyguenther 69
  • Clean up ATN serialization: rm UUID and shifting by value of 2

    Clean up ATN serialization: rm UUID and shifting by value of 2

    • I think we don't need the UUID in the serialization, since it has not changed in a decade. We can bump the version number and remove the UU ID
    • I did some tests and there seems to be no reason to shift the values in the serialized ATN by 2 for the purposes of improving the UTF-8 encoding for the Java target.

    If you guys agree, we can make this small change for cleanup purposes. I'm happy to do it if you guys don't want to. The second fix will require changes to each target but it's trivial to fix.

    atn-analysis type:cleanup 
    opened by parrt 68
  • Preparing for 4.9.3 release

    Preparing for 4.9.3 release

    It's that time of year again! @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos Shall we do a 4.9.3 release?

    I went through and marked all of the merged PRs and related issues with 4.9.3 and try to tag them according to their target. Would you guys like to go through the PRs to see if there's something that should be merged quickly?

    opened by parrt 68
  • [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    @ericvergnaud Hi, I modified csproj options a bit, now I can get a working nuget package locally without the issue we described in #2021. I added .net 3.5 as a target to "main" csproj along with netstandard, since it's easier to keep track of requirements for both sets of api's when editing code and, ideally, both targets can be packed into a nuget package with a single command. Right now it's possible only on Windows via msbuild /t:pack or Visual Studio; unfortunately, due to https://github.com/Microsoft/msbuild/issues/1333, right now dotnet build pack does not work for .net 3.5 target the way it should, so I adjusted the existing script to create packages from .nuspec and different solutions for different targets.

    comp:build target:csharp 
    opened by listerenko 68
  • A few updates to the Unicode documentation.

    A few updates to the Unicode documentation.

    It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

    comp:doc 
    opened by mike-lischke 61
  • Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    This is a serious bug in the CSharp runtime, ParserATNSimulator.cs.

    reachesIntoOuterContext is an integer that tracks the depth of how far we dip into the outer context. This field must be interpreted with the bit map mask SUPPRESS_PRECEDENCE_FILTER, and careful attention placed on whether to access the field raw or with the bit map mask.

    In Java, this code accesses the field raw.

    In CSharp, the field is accessed through a method that applies the bit map mask. This is a serious problem in that it causes ATN parser trace divergence.

    I did a cursory check on the other targets, and I am not confident the field access is done correctly across targets.

    The fix is the change ParserATNInterpreter.cs:

    diff --git a/runtime/CSharp/src/Atn/ParserATNSimulator.cs b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    index 4a7a6a3d5..8463bfcc3 100644
    --- a/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    +++ b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    @@ -1579,7 +1579,7 @@ namespace Antlr4.Runtime.Atn
     						// This assignment also propagates the
     						// isPrecedenceFilterSuppressed() value to the new
     						// configuration.
    -						c.reachesIntoOuterContext = config.OuterContextDepth;
    +                                                c.reachesIntoOuterContext = config.reachesIntoOuterContext;
     						ClosureCheckingStopState(c, configSet, closureBusy, collectPredicates,
     												 fullCtx, depth - 1, treatEofAsEpsilon);
     					}
    

    (Note, the source code should really not be using tabs. Tab expansion is editor specific.)

    atn-analysis type:bug target:csharp 
    opened by kaby76 3
  • Python runtime test failure with 4.10 and later

    Python runtime test failure with 4.10 and later

    After upgrading the antlr4 python runtime past 4.9.1 (tested 4.10.1 and 4.11.1) in nixpkgs, we've been seeing its test suite fail with

    Traceback (most recent call last):
      File "/build/source/runtime/Python3/tests/ctest.py", line 10, in <module>
        from parser.cparser import CParser
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 631, in <module>
        class CParser ( Parser ):
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 635, in CParser
        atn = ATNDeserializer().deserialize(serializedATN())
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 28, in deserialize
        self.checkVersion()
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 50, in checkVersion
        raise Exception("Could not deserialize ATN with version " + str(version) + " (expected " + str(SERIALIZED_VERSION) + ").")
    Exception: Could not deserialize ATN with version  (expected 4).
    

    The version returned seems to be \x03, while the test suite expects it to be 4.

    While similar to #3997 and #3895 we are running the test suite, not our own code.

    We're running python ctest.py from the tests directory.

    opened by mweinelt 4
  • Suggestion for the Test Rig GUI tool

    Suggestion for the Test Rig GUI tool

    This is not a defect or reporting a problem. It is a suggestion that the GUI tool have a new button added between "OK" and "Export as PNG". That can be named "Reload".

    The idea being that I can edit/save a test source file and just click "reload" rather than doing what I do which is close the tool and go back to my command window and rerun the batch file that starts the GUI.

    Thoughts?

    opened by Korporal 0
  •  invalid syntax `self.from = None` in python3 generated parsers

    invalid syntax `self.from = None` in python3 generated parsers

    Please include the following information

    python3 antlr4 version is 4.9.3

    • smallest possible grammar and code that reproduces the behavior
    import sys
    import antlr4
    from PrestoSqlLexer import PrestoSqlLexer
    from PrestoSqlParser import PrestoSqlParser
    
    
    def main():
        input_stream = antlr4.CharStream('select 1')
        lexer = PrestoSqlLexer(input_stream)
        tokens = antlr4.CommonTokenStream(lexer)
        tokens.fill()
        print([token.text for token in tokens.tokens][:-1])
    
    
    if __name__ == '__main__':
        main()
    
    • description of the expected behavior and actual behavior Pointers to suspicious code regions are also very welcome.

    error:

    Traceback (most recent call last):
      File "/Users/clients/autocomplete/parseQuery.py", line 4, in <module>
        from PrestoSqlParser import PrestoSqlParser
      File "/Users/clients/autocomplete/PrestoSqlParser.py", line 1837
        self.from = None # QualifiedNameContext
             ^^^^
    SyntaxError: invalid syntax
    
    opened by chengchengpei 2
Releases(4.11.1)
Owner
Antlr Project
The Project organization for the ANTLR parser generator.
Antlr Project
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

Guangxu Xun 38 Dec 31, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 06, 2023