ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Overview

ANTLR v4

Java 7+ License

Build status

Github CI Build Status (MacOSX) AppVeyor CI Build Status (Windows) Circle CI Build Status (Linux)

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Donate

Authors and major contributors

Useful information

You might also find the following pages useful, particularly if you want to mess around with the various target languages.

The Definitive ANTLR 4 Reference

Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language—ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features.

You can buy the book The Definitive ANTLR 4 Reference at amazon or an electronic version at the publisher's site.

You will find the Book source code useful.

Additional grammars

This repository is a collection of grammars without actions where the root directory name is the all-lowercase name of the language parsed by the grammar. For example, java, cpp, csharp, c, etc...

Comments
  • New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    Fixes #276 .

    This used to be a WIP PR, but it's now ready for review.

    This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

    The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

    I bumped the UUID, since this changes the serialized ATN format.

    I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

    type:feature unicode 
    opened by bhamiltoncx 115
  • Terrible Golang performance

    Terrible Golang performance

    Stackoverflow: https://stackoverflow.com/questions/72266899/golang-performance-issues

    Google group: https://groups.google.com/g/antlr-discussion/c/OdhAIsy2GfI

    Example code: https://github.com/movelazar/perf-repro

    A simple rule such as:

    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2
    

    takes exponentially longer to parse the more 1 EQ 2 OR clauses there are. This does not happen in python (by my testing) or CSharp, Dart, Java (by stackoverflow comment).

    On my machine, # of lines vs parse time:

    11: 0.5s
    12: 1.2s
    13: 3.2s
    14: 8.1s
    15: 21.9s
    16: 57.5s
    

    Given that Python doesn't face this issue I can't imagine I'm doing something terrible in my grammar.

    Issue goes away if I put parens on things but that's not a real solution.

    On 4.10.1, first noticed with 4.9.1.

    Any help is greatly appreciated. Surprised I can't find others with this issue.

    type:bug target:go comp:performance 
    opened by movelazar 101
  • splitting version numbers for targets

    splitting version numbers for targets

    Hiya: @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos

    Eric has raised the point that it would be nice to be able to make quick patches to the various runtimes; e.g., there is a stopping bug now in the JavaScript target. He proposes something along these lines:

    • any change in the tool or the runtime algorithm bumps the middle version #: 4.9 -> 4.10 -> 4.11
    • any bug fix in a runtime we bump the last digit of that runtime only: 4.9 -> 4.9.1 -> 4.9.2
    • if bumping the java runtime for bug fix we also bump the tool since it contains the runtime

    This is in optimal as people have criticized me in the past for bumping, say, 4.6 to 4.7 for some minor changes. It also has the problem that 4.9.x will not mean the same thing in two different targets possibly, as each target will now have their own version number.

    Rather than break up all of the targets into separate repositories or similar, can you guys think of a better solution? Any suggestions? The goal here is to allow more rapid target releases, and independent of me having to do a major release of the tool.

    type:question 
    opened by parrt 94
  • Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

    I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

    I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

    Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

    To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

    lexers target:java comp:performance 
    opened by bhamiltoncx 85
  • initial discussion to start integration of new targets

    initial discussion to start integration of new targets

    As promised, I am now ready to integrate the new ANTLR target languages you folks have been working on. This issue is meant to get everybody in sync, check status, and discuss the proper order of integration and resolve issues etc.

    There are two administrative details to get out of the way first:

    1. Please let me know if there is another github user that should be added to one of the categories. Or, of course, if you would like your user ID removed from this discussion.
    2. Nothing can be merged into antlr/antlr4 unless every single committer has added themselves to the contributors.txt file. It's onerous, particularly for simple commits, but it is requirement for anything merged into the master. Eclipse foundation lawyers tell me that we have one of the cleanest licenses out there and it contributes to ANTLR's widespread use because companies are not afraid to use the software. See the genesis of such heinous requirements in SCO v IBM. This means lead target authors have to go back through their committers list quickly and ask them to sign the contributors file with a new commit. Or, they can remove that commit and enter their own version of the functionality, being careful not to violate copyright on the previous.

    As we proceed, please keep in mind that I have a difficult role, balancing the needs of multiple targets and keeping discussions in the civil and practical zone. Decisions I make come from the perspective of over 25 years managing and leading this project. I look forward to incorporating your hard work into the main antlr repo.

    C++ current location

    • @mike-lischke
    • @DanMcLaughlin
    • @nburles
    • @davesisson

    Go current location, previous discussion

    • @pboyer

    Swift current location: unclear, previous discussion

    • @jeffreyguenther
    • @hanjoes
    • @janyou
    • @ewanmellor

    Likely interested/supporting humans (scraped from github issues):

    • @RYDB3RG
    • @wjkohnen
    • @willfaught
    • @parrt
    • @sharwell
    • @ericvergnaud
    type:improvement target:swift target:cpp target:go 
    opened by parrt 84
  • Add a new CharStream that converts the symbols to upper or lower case.

    Add a new CharStream that converts the symbols to upper or lower case.

    This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at https://github.com/antlr/antlr4test-maven-plugin/issues/1

    It would be used like so:

    input, _ := antlr.NewFileStream("filename")
    
    in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.
    
    lexer := parser.NewAbnfLexer(in)
    

    While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

    I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

    lexers comp:runtime target:java target:javascript target:go 
    opened by bramp 69
  • Swift Target

    Swift Target

    I did a quick search and I didn't see anything written about this yet. What's the likelihood of a Swift target for ANTLR?

    There are C#, Javascript, and Python targets at the moment.

    What does it take to implement a target? Given that Swift is more Java-like, it seems like it should be possible. Maybe start with a code translator if there is one for (Java to Swift), and iterate towards a more idiomatic implementation.

    type:question 
    opened by jeffreyguenther 69
  • Clean up ATN serialization: rm UUID and shifting by value of 2

    Clean up ATN serialization: rm UUID and shifting by value of 2

    • I think we don't need the UUID in the serialization, since it has not changed in a decade. We can bump the version number and remove the UU ID
    • I did some tests and there seems to be no reason to shift the values in the serialized ATN by 2 for the purposes of improving the UTF-8 encoding for the Java target.

    If you guys agree, we can make this small change for cleanup purposes. I'm happy to do it if you guys don't want to. The second fix will require changes to each target but it's trivial to fix.

    atn-analysis type:cleanup 
    opened by parrt 68
  • Preparing for 4.9.3 release

    Preparing for 4.9.3 release

    It's that time of year again! @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos Shall we do a 4.9.3 release?

    I went through and marked all of the merged PRs and related issues with 4.9.3 and try to tag them according to their target. Would you guys like to go through the PRs to see if there's something that should be merged quickly?

    opened by parrt 68
  • [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    @ericvergnaud Hi, I modified csproj options a bit, now I can get a working nuget package locally without the issue we described in #2021. I added .net 3.5 as a target to "main" csproj along with netstandard, since it's easier to keep track of requirements for both sets of api's when editing code and, ideally, both targets can be packed into a nuget package with a single command. Right now it's possible only on Windows via msbuild /t:pack or Visual Studio; unfortunately, due to https://github.com/Microsoft/msbuild/issues/1333, right now dotnet build pack does not work for .net 3.5 target the way it should, so I adjusted the existing script to create packages from .nuspec and different solutions for different targets.

    comp:build target:csharp 
    opened by listerenko 68
  • A few updates to the Unicode documentation.

    A few updates to the Unicode documentation.

    It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

    comp:doc 
    opened by mike-lischke 61
  • Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    This is a serious bug in the CSharp runtime, ParserATNSimulator.cs.

    reachesIntoOuterContext is an integer that tracks the depth of how far we dip into the outer context. This field must be interpreted with the bit map mask SUPPRESS_PRECEDENCE_FILTER, and careful attention placed on whether to access the field raw or with the bit map mask.

    In Java, this code accesses the field raw.

    In CSharp, the field is accessed through a method that applies the bit map mask. This is a serious problem in that it causes ATN parser trace divergence.

    I did a cursory check on the other targets, and I am not confident the field access is done correctly across targets.

    The fix is the change ParserATNInterpreter.cs:

    diff --git a/runtime/CSharp/src/Atn/ParserATNSimulator.cs b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    index 4a7a6a3d5..8463bfcc3 100644
    --- a/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    +++ b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    @@ -1579,7 +1579,7 @@ namespace Antlr4.Runtime.Atn
     						// This assignment also propagates the
     						// isPrecedenceFilterSuppressed() value to the new
     						// configuration.
    -						c.reachesIntoOuterContext = config.OuterContextDepth;
    +                                                c.reachesIntoOuterContext = config.reachesIntoOuterContext;
     						ClosureCheckingStopState(c, configSet, closureBusy, collectPredicates,
     												 fullCtx, depth - 1, treatEofAsEpsilon);
     					}
    

    (Note, the source code should really not be using tabs. Tab expansion is editor specific.)

    atn-analysis type:bug target:csharp 
    opened by kaby76 3
  • Python runtime test failure with 4.10 and later

    Python runtime test failure with 4.10 and later

    After upgrading the antlr4 python runtime past 4.9.1 (tested 4.10.1 and 4.11.1) in nixpkgs, we've been seeing its test suite fail with

    Traceback (most recent call last):
      File "/build/source/runtime/Python3/tests/ctest.py", line 10, in <module>
        from parser.cparser import CParser
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 631, in <module>
        class CParser ( Parser ):
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 635, in CParser
        atn = ATNDeserializer().deserialize(serializedATN())
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 28, in deserialize
        self.checkVersion()
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 50, in checkVersion
        raise Exception("Could not deserialize ATN with version " + str(version) + " (expected " + str(SERIALIZED_VERSION) + ").")
    Exception: Could not deserialize ATN with version  (expected 4).
    

    The version returned seems to be \x03, while the test suite expects it to be 4.

    While similar to #3997 and #3895 we are running the test suite, not our own code.

    We're running python ctest.py from the tests directory.

    opened by mweinelt 4
  • Suggestion for the Test Rig GUI tool

    Suggestion for the Test Rig GUI tool

    This is not a defect or reporting a problem. It is a suggestion that the GUI tool have a new button added between "OK" and "Export as PNG". That can be named "Reload".

    The idea being that I can edit/save a test source file and just click "reload" rather than doing what I do which is close the tool and go back to my command window and rerun the batch file that starts the GUI.

    Thoughts?

    opened by Korporal 0
  •  invalid syntax `self.from = None` in python3 generated parsers

    invalid syntax `self.from = None` in python3 generated parsers

    Please include the following information

    python3 antlr4 version is 4.9.3

    • smallest possible grammar and code that reproduces the behavior
    import sys
    import antlr4
    from PrestoSqlLexer import PrestoSqlLexer
    from PrestoSqlParser import PrestoSqlParser
    
    
    def main():
        input_stream = antlr4.CharStream('select 1')
        lexer = PrestoSqlLexer(input_stream)
        tokens = antlr4.CommonTokenStream(lexer)
        tokens.fill()
        print([token.text for token in tokens.tokens][:-1])
    
    
    if __name__ == '__main__':
        main()
    
    • description of the expected behavior and actual behavior Pointers to suspicious code regions are also very welcome.

    error:

    Traceback (most recent call last):
      File "/Users/clients/autocomplete/parseQuery.py", line 4, in <module>
        from PrestoSqlParser import PrestoSqlParser
      File "/Users/clients/autocomplete/PrestoSqlParser.py", line 1837
        self.from = None # QualifiedNameContext
             ^^^^
    SyntaxError: invalid syntax
    
    opened by chengchengpei 2
Releases(4.11.1)
Owner
Antlr Project
The Project organization for the ANTLR parser generator.
Antlr Project
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

K-BERT Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework. R

Weijie Liu 834 Jan 09, 2023
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
AutoGluon: AutoML for Text, Image, and Tabular Data

AutoML for Text, Image, and Tabular Data AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in yo

Amazon Web Services - Labs 5.2k Dec 29, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022