While building the Analytics Workbench within Infoveave, our objective was to make it easy for statisticians and data scientists to interact with R editor. For those hands-on on R, would know that the user experience is limited. The user does not get auto complete, hints, suggestions or a language editor by default. To provide this rich experience while executing R code required to understand what the user was keying in realtime and provide intelligent feedback to the user as hints, help and auto complete text. In short, we need to create a language parser that would do this for us.
Creating Rich language parser for R
To start understanding “R” it was not just enough for us to parse a single function but complex code blocks. Any one who has ever worked on this kind of problem would know how difficult it is to build a parser. We wanted to use something that is proven, time tested, runs on the web and is fairly fast. And this is where “ANTLR” came in.
What is “ANTLR”?
For the unitiated, ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing or translating structured text or binary files.
ANTLR made it fairly simple to build a parser and a JSON representation of AST which can be consumed by the Infoveave R editor to provide hints and handle syntax errors.
Where to begin
To master any language, you have to learn the rules for the language, and the rules are defined as Lexers and Parsers.
Lexer: A Lexer takes a series a characters and converts into tokens. This is the same process as your brain is applying while reading this article. You read a series of characters and form words from it.
Parser: A Parser is a set of rules on these tokens defined in an order which will help understand the lanugage.
These are written into a file called grammar file, lets write a simple grammar for parsing arithmetic expressions, we would write this as
Suggest to read this bottom up to understand better but it’s fairly simple. First rule is to ignore whitespaces. The next is to put together a series of numbers make a “NUMBER” token followed by four rules for simple operations. Use lowercase identifiers for parser rules and uppercase identifiers for Lexer rules.
Refer to https://github.com/antlr/antlr4/blob/master/doc/grammars.md for more information.
Lets build for the Web
With the parser and lexer code generated we are now one step closer to our objective. We can then use the generated code to build our custom logic.
Listener and Visitors
The above script can use Listener or Visitor patterns for parser implementation. The primary difference in the two implementations is in the tree walking mechanism:
Listener : The enter and exit functions in the listener are called by the ANTLR provided walker when the tree is parsed. We have no control over the tree traversal.
Visitor : The visitor is where we are in control on the tree traversal, but the caveat being we have to remember to visit the children of expression.
We followed the visitor pattern as it makes it more fit for our purpose of interpretation of the language and converting to JSON.
The Final Output
A quick and dirty implementation of a single line expression parser looks like
Now we could easily transform user entered code to JSON which we could work with
“ANTLR” is very powerful and one can do a lot more with it than just parsing expressions. It can be used to build a grammar for an entirely new language. We were able to quickly build an expression parser in week.
The initial version of a single line parser is also available on github. You can see how a full implementation was done, and as an added bonus there is also an implementation of errorListener !