ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Overview

ANTLR v4

Java 7+ License

Build status

Github CI Build Status (MacOSX) AppVeyor CI Build Status (Windows) Circle CI Build Status (Linux)

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build parse trees and also generates a listener interface (or visitor) that makes it easy to respond to the recognition of phrases of interest.

Donate

Authors and major contributors

Useful information

You might also find the following pages useful, particularly if you want to mess around with the various target languages.

The Definitive ANTLR 4 Reference

Programmers run into parsing problems all the time. Whether it’s a data format like JSON, a network protocol like SMTP, a server configuration file for Apache, a PostScript/PDF file, or a simple spreadsheet macro language—ANTLR v4 and this book will demystify the process. ANTLR v4 has been rewritten from scratch to make it easier than ever to build parsers and the language applications built on top. This completely rewritten new edition of the bestselling Definitive ANTLR Reference shows you how to take advantage of these new features.

You can buy the book The Definitive ANTLR 4 Reference at amazon or an electronic version at the publisher's site.

You will find the Book source code useful.

Additional grammars

This repository is a collection of grammars without actions where the root directory name is the all-lowercase name of the language parsed by the grammar. For example, java, cpp, csharp, c, etc...

Comments
  • New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF

    Fixes #276 .

    This used to be a WIP PR, but it's now ready for review.

    This PR introduces a new extended Unicode escape \u{10ABCD} in ANTLR4 grammars to support Unicode literal values > U+FFFF.

    The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.

    I bumped the UUID, since this changes the serialized ATN format.

    I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.

    type:feature unicode 
    opened by bhamiltoncx 115
  • Terrible Golang performance

    Terrible Golang performance

    Stackoverflow: https://stackoverflow.com/questions/72266899/golang-performance-issues

    Google group: https://groups.google.com/g/antlr-discussion/c/OdhAIsy2GfI

    Example code: https://github.com/movelazar/perf-repro

    A simple rule such as:

    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2 OR
    1 EQ 2
    

    takes exponentially longer to parse the more 1 EQ 2 OR clauses there are. This does not happen in python (by my testing) or CSharp, Dart, Java (by stackoverflow comment).

    On my machine, # of lines vs parse time:

    11: 0.5s
    12: 1.2s
    13: 3.2s
    14: 8.1s
    15: 21.9s
    16: 57.5s
    

    Given that Python doesn't face this issue I can't imagine I'm doing something terrible in my grammar.

    Issue goes away if I put parens on things but that's not a real solution.

    On 4.10.1, first noticed with 4.9.1.

    Any help is greatly appreciated. Surprised I can't find others with this issue.

    type:bug target:go comp:performance 
    opened by movelazar 101
  • splitting version numbers for targets

    splitting version numbers for targets

    Hiya: @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos

    Eric has raised the point that it would be nice to be able to make quick patches to the various runtimes; e.g., there is a stopping bug now in the JavaScript target. He proposes something along these lines:

    • any change in the tool or the runtime algorithm bumps the middle version #: 4.9 -> 4.10 -> 4.11
    • any bug fix in a runtime we bump the last digit of that runtime only: 4.9 -> 4.9.1 -> 4.9.2
    • if bumping the java runtime for bug fix we also bump the tool since it contains the runtime

    This is in optimal as people have criticized me in the past for bumping, say, 4.6 to 4.7 for some minor changes. It also has the problem that 4.9.x will not mean the same thing in two different targets possibly, as each target will now have their own version number.

    Rather than break up all of the targets into separate repositories or similar, can you guys think of a better solution? Any suggestions? The goal here is to allow more rapid target releases, and independent of me having to do a major release of the tool.

    type:question 
    opened by parrt 94
  • Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    Improve memory usage and perf of CodePointCharStream: Use 8-bit, 16-bit, or 32-bit buffer

    This greatly improves the memory usage and performance of CodePointCharStream by ensuring the internal storage uses either an 8-bit buffer (for Unicode code points <= U+00FF), 16-bit buffer (for Unicode code points <= U+FFFF), or a 32-bit buffer (Unicode code points > U+FFFF).

    I split out the internal storage into a class CodePointBuffer which has a CodePointBuffer.Builder class which has the logic to upgrade from 8-bit to 16-bit to 32-bit storage.

    I found the perf hotspot in CodePointCharStream on master was the virtual method calls from CharStream.LA(offset) into IntBuffer.

    Refactoring it into CodePointBuffer didn't help (in fact, it added another virtual method call).

    To fix the perf, I made CodePointCharStream an abstract class and made three concrete subclasses: CodePoint8BitCharStream, CodePoint16BitCharStream, and CodePoint32BitCharStream which directly access the array of underlying code points in the CodePointBuffer without virtual method calls.

    lexers target:java comp:performance 
    opened by bhamiltoncx 85
  • initial discussion to start integration of new targets

    initial discussion to start integration of new targets

    As promised, I am now ready to integrate the new ANTLR target languages you folks have been working on. This issue is meant to get everybody in sync, check status, and discuss the proper order of integration and resolve issues etc.

    There are two administrative details to get out of the way first:

    1. Please let me know if there is another github user that should be added to one of the categories. Or, of course, if you would like your user ID removed from this discussion.
    2. Nothing can be merged into antlr/antlr4 unless every single committer has added themselves to the contributors.txt file. It's onerous, particularly for simple commits, but it is requirement for anything merged into the master. Eclipse foundation lawyers tell me that we have one of the cleanest licenses out there and it contributes to ANTLR's widespread use because companies are not afraid to use the software. See the genesis of such heinous requirements in SCO v IBM. This means lead target authors have to go back through their committers list quickly and ask them to sign the contributors file with a new commit. Or, they can remove that commit and enter their own version of the functionality, being careful not to violate copyright on the previous.

    As we proceed, please keep in mind that I have a difficult role, balancing the needs of multiple targets and keeping discussions in the civil and practical zone. Decisions I make come from the perspective of over 25 years managing and leading this project. I look forward to incorporating your hard work into the main antlr repo.

    C++ current location

    • @mike-lischke
    • @DanMcLaughlin
    • @nburles
    • @davesisson

    Go current location, previous discussion

    • @pboyer

    Swift current location: unclear, previous discussion

    • @jeffreyguenther
    • @hanjoes
    • @janyou
    • @ewanmellor

    Likely interested/supporting humans (scraped from github issues):

    • @RYDB3RG
    • @wjkohnen
    • @willfaught
    • @parrt
    • @sharwell
    • @ericvergnaud
    type:improvement target:swift target:cpp target:go 
    opened by parrt 84
  • Add a new CharStream that converts the symbols to upper or lower case.

    Add a new CharStream that converts the symbols to upper or lower case.

    This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at https://github.com/antlr/antlr4test-maven-plugin/issues/1

    It would be used like so:

    input, _ := antlr.NewFileStream("filename")
    
    in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.
    
    lexer := parser.NewAbnfLexer(in)
    

    While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

    I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

    lexers comp:runtime target:java target:javascript target:go 
    opened by bramp 69
  • Swift Target

    Swift Target

    I did a quick search and I didn't see anything written about this yet. What's the likelihood of a Swift target for ANTLR?

    There are C#, Javascript, and Python targets at the moment.

    What does it take to implement a target? Given that Swift is more Java-like, it seems like it should be possible. Maybe start with a code translator if there is one for (Java to Swift), and iterate towards a more idiomatic implementation.

    type:question 
    opened by jeffreyguenther 69
  • Clean up ATN serialization: rm UUID and shifting by value of 2

    Clean up ATN serialization: rm UUID and shifting by value of 2

    • I think we don't need the UUID in the serialization, since it has not changed in a decade. We can bump the version number and remove the UU ID
    • I did some tests and there seems to be no reason to shift the values in the serialized ATN by 2 for the purposes of improving the UTF-8 encoding for the Java target.

    If you guys agree, we can make this small change for cleanup purposes. I'm happy to do it if you guys don't want to. The second fix will require changes to each target but it's trivial to fix.

    atn-analysis type:cleanup 
    opened by parrt 68
  • Preparing for 4.9.3 release

    Preparing for 4.9.3 release

    It's that time of year again! @pboyer, @mike-lischke, @janyou, @ewanmellor, @hanjoes, @ericvergnaud, @lingyv-li, @marcospassos Shall we do a 4.9.3 release?

    I went through and marked all of the merged PRs and related issues with 4.9.3 and try to tag them according to their target. Would you guys like to go through the PRs to see if there's something that should be merged quickly?

    opened by parrt 68
  • [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    [CSharp] #2021 fixes nuget packaging options to avoid missing dll exceptions

    @ericvergnaud Hi, I modified csproj options a bit, now I can get a working nuget package locally without the issue we described in #2021. I added .net 3.5 as a target to "main" csproj along with netstandard, since it's easier to keep track of requirements for both sets of api's when editing code and, ideally, both targets can be packed into a nuget package with a single command. Right now it's possible only on Windows via msbuild /t:pack or Visual Studio; unfortunately, due to https://github.com/Microsoft/msbuild/issues/1333, right now dotnet build pack does not work for .net 3.5 target the way it should, so I adjusted the existing script to create packages from .nuspec and different solutions for different targets.

    comp:build target:csharp 
    opened by listerenko 68
  • A few updates to the Unicode documentation.

    A few updates to the Unicode documentation.

    It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

    comp:doc 
    opened by mike-lischke 61
  • Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    Assorted problems in calling the wrong wrapper for reachesIntoOuterContext

    This is a serious bug in the CSharp runtime, ParserATNSimulator.cs.

    reachesIntoOuterContext is an integer that tracks the depth of how far we dip into the outer context. This field must be interpreted with the bit map mask SUPPRESS_PRECEDENCE_FILTER, and careful attention placed on whether to access the field raw or with the bit map mask.

    In Java, this code accesses the field raw.

    In CSharp, the field is accessed through a method that applies the bit map mask. This is a serious problem in that it causes ATN parser trace divergence.

    I did a cursory check on the other targets, and I am not confident the field access is done correctly across targets.

    The fix is the change ParserATNInterpreter.cs:

    diff --git a/runtime/CSharp/src/Atn/ParserATNSimulator.cs b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    index 4a7a6a3d5..8463bfcc3 100644
    --- a/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    +++ b/runtime/CSharp/src/Atn/ParserATNSimulator.cs
    @@ -1579,7 +1579,7 @@ namespace Antlr4.Runtime.Atn
     						// This assignment also propagates the
     						// isPrecedenceFilterSuppressed() value to the new
     						// configuration.
    -						c.reachesIntoOuterContext = config.OuterContextDepth;
    +                                                c.reachesIntoOuterContext = config.reachesIntoOuterContext;
     						ClosureCheckingStopState(c, configSet, closureBusy, collectPredicates,
     												 fullCtx, depth - 1, treatEofAsEpsilon);
     					}
    

    (Note, the source code should really not be using tabs. Tab expansion is editor specific.)

    atn-analysis type:bug target:csharp 
    opened by kaby76 3
  • Python runtime test failure with 4.10 and later

    Python runtime test failure with 4.10 and later

    After upgrading the antlr4 python runtime past 4.9.1 (tested 4.10.1 and 4.11.1) in nixpkgs, we've been seeing its test suite fail with

    Traceback (most recent call last):
      File "/build/source/runtime/Python3/tests/ctest.py", line 10, in <module>
        from parser.cparser import CParser
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 631, in <module>
        class CParser ( Parser ):
      File "/build/source/runtime/Python3/tests/parser/cparser.py", line 635, in CParser
        atn = ATNDeserializer().deserialize(serializedATN())
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 28, in deserialize
        self.checkVersion()
      File "/nix/store/ch8i929c63av55h9nxkinifh61mazf1h-python3.10-antlr4-python3-runtime-4.11.1/lib/python3.10/site-packages/antlr4/atn/ATNDeserializer.py", line 50, in checkVersion
        raise Exception("Could not deserialize ATN with version " + str(version) + " (expected " + str(SERIALIZED_VERSION) + ").")
    Exception: Could not deserialize ATN with version  (expected 4).
    

    The version returned seems to be \x03, while the test suite expects it to be 4.

    While similar to #3997 and #3895 we are running the test suite, not our own code.

    We're running python ctest.py from the tests directory.

    opened by mweinelt 4
  • Suggestion for the Test Rig GUI tool

    Suggestion for the Test Rig GUI tool

    This is not a defect or reporting a problem. It is a suggestion that the GUI tool have a new button added between "OK" and "Export as PNG". That can be named "Reload".

    The idea being that I can edit/save a test source file and just click "reload" rather than doing what I do which is close the tool and go back to my command window and rerun the batch file that starts the GUI.

    Thoughts?

    opened by Korporal 0
  •  invalid syntax `self.from = None` in python3 generated parsers

    invalid syntax `self.from = None` in python3 generated parsers

    Please include the following information

    python3 antlr4 version is 4.9.3

    • smallest possible grammar and code that reproduces the behavior
    import sys
    import antlr4
    from PrestoSqlLexer import PrestoSqlLexer
    from PrestoSqlParser import PrestoSqlParser
    
    
    def main():
        input_stream = antlr4.CharStream('select 1')
        lexer = PrestoSqlLexer(input_stream)
        tokens = antlr4.CommonTokenStream(lexer)
        tokens.fill()
        print([token.text for token in tokens.tokens][:-1])
    
    
    if __name__ == '__main__':
        main()
    
    • description of the expected behavior and actual behavior Pointers to suspicious code regions are also very welcome.

    error:

    Traceback (most recent call last):
      File "/Users/clients/autocomplete/parseQuery.py", line 4, in <module>
        from PrestoSqlParser import PrestoSqlParser
      File "/Users/clients/autocomplete/PrestoSqlParser.py", line 1837
        self.from = None # QualifiedNameContext
             ^^^^
    SyntaxError: invalid syntax
    
    opened by chengchengpei 2
Releases(4.11.1)
Owner
Antlr Project
The Project organization for the ANTLR parser generator.
Antlr Project
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023
A list of NLP(Natural Language Processing) tutorials

NLP Tutorial A list of NLP(Natural Language Processing) tutorials built on PyTorch. Table of Contents A step-by-step tutorial on how to implement and

Allen Lee 1.3k Dec 25, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

PyCon Taiwan 4 Aug 20, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

Damian Mirecki 18 Dec 29, 2021
Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Linear Multihead Attention (Linformer) PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer:

Kui Xu 58 Dec 23, 2022
Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022