JJTree Tutorial for Advanced Java Parsing

The Problem

JJTree is a part of JavaCC is a parser/scanner generator for Java. JJTree is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source. To follow along you need to understand the core concepts of parsing. Also review basic JJTree documentation and samples provided in JavaCC distribution (version 4.0).

JJTree is magically powerful, but it is as complex. We used it quite successfully at my startup www.moola.com. After some the basic research into the grammar rules, lookaheads, node annotations and prototyping I felt quite comfortable with the tool. However, just recently when I had to use JJTree again I hit the same steep learning curve as if I have never seen JJTree before.

How to write a tutorial that gets you back in shape quickly without forcing the full relearning?

The Solution

Here I capture my notes in a specific form that I do not have to face that same learning curve again in the future. You can think my approach as layered improvement to a grammar that follows these steps:

  • get lexer
  • complete grammar
  • optimize produced AST
  • define custom node
  • define actions
  • write evaluator

I always start simple and need to go more complex - this is exactly how I will document it. In each example I start with a trivial portion of grammar and then add some more to it to force specific behavior. New code is always in green. Let‘s hope this save all of us the relearning.

Reorder tokens from more specific to less specific

The token in TOKEN section can be declared in any order. But you have to pay very close attention to the order because the matching of tokens starts from the top and down the list until first matching token is found. For example notice how "interface" or "exception" are defined before STRING_LITERAL. If we had defined "interface" after STRING_LITERAL "interface" would never get matched,  STRING_LITERAL would.

TOKEN : {
	  <INTERFACE: "interface" >
	| < EXCEPTION: "exception" >
	| < ENUM: "enum" >
	| < STRUCT: "struct" >

	| < STRING_LITERAL: "‘" (~["‘","\n","\r"])* "‘" >
	| < TERM: <LETTER> (<LETTER>|<DIGIT>)* >

	| < NUMBER: <INTEGER> | <FLOAT> >
	| < INTEGER: ["0"-"9"] (["0"-"9"])* >
	| < FLOAT: (["0"-"9"])+ "." (["0"-"9"])* >
	| < DIGIT: ["0"-"9"] >
	| < LETTER: ["_","a"-"z","A"-"Z"] >
}

The ordering is the same reason why we can‘t just use "interface" inline in the definition of productions. The STRING_LITERAL will always match first.

Remove some nodes from final AST

Some nodes do not have any special meaning and should be excluded from the final AST.  This is done by using #void like this:

void InterfaceDecl() #void : {
}{
	ExceptionClause()
	|
	EnumClause()
	|
	StructClause()
	|
	MethodDecl()
}

Add action to a production

You will definitely need to add actions to the production for your parser to be useful. Here I capture the text of the current token (t.image) and put it into jjThis node that will resolve to my custom node class TypeDecl. You bind a variable "t" to a token using "="; the action itself is in curly braces right after the production and can refer to current token as "t" and current AST node as "jjtThis".

void TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]")?}
}

Here I further set isArray property to true only if "[]" is found after the <TERM>:

void TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
}

Multiple actions inside one production rule

Just as we have seen earlier you can access values of multiple token in one production rule. Notice how I declare two separate tokens "t" and "n". Here:

void ConstDecl() : {
	Token t;
	Token n;
}
{
	LOOKAHEAD(2)

	t=<TERM> { jjtThis.name = t.image; } "=" n=<NUMBER> { jjtThis.value = Integer.valueOf(n.image); }
	|
	<TERM>
}

Lookaheads

There are certain points in complex  grammars that might not get parsed unambiguously using just one token look ahead. If you are writing high performance parser you might need to rewrite grammar. But if do not care about performance you can force lookahead for more that one symbol.

JJTree generator will give you a warning about ambiguities. Go the the rule it refers to and set lookahead of 2 or more like this:

void EnumDeclItem() : {}
{
	LOOKAHEAD(2)

	<TERM> "=" <NUMBER>
	|
	<TERM>
}

Node return values

It is possible to return nodes from the productions, just like function return values. Here I am declaring the ASTTypeDecl will be returned.

ASTTypeDecl TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}

	{ return jjtThis; }
}

Once you start having a lot of expressions in one production it is better to group them together so return statement applies to all of them. The above example will actually result in a bug due to a fact that the return statement is attached to one branch of "|" production and not to both branches. We can easily fix the issue using parenthesis to force order of precendence:

ASTTypeDecl TypeDecl() : {
	Token t;
}
{
	(
		<VOID>
		|
		t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
	)

	{ return jjtThis; }
}

Build abstract syntax tree as you go

After you have all production return values you can build AST tree on the fly while parsing. Just provide found overloaded add() methods in the ASTInterfaceDecl class and call them like this:

void InterfaceDecl() #void : {
	ASTExceptionClause ex;
	ASTEnumClause en;
	ASTStructClause st;
	ASTMethodDecl me;
}
	ex=ExceptionClause() { jjtThis.add(ex); }
	|
	en=EnumClause() { jjtThis.add(en); }
	|
	st=StructClause() { jjtThis.add(st); }
	|
	me=MethodDecl() { jjtThis.add(me); }
}

Use <EOF>

Quite often you can get your grammar written and start celebration when you notice that part of the file is not being parsed... This happens because you did not tell the parser to read all content till the end of file and it feels free to stop parsing at will. Force parsing to reach end of file by demanding <EOF> token at the top most production:

void InterfaceDecl() #void : {
}{
	ExceptionClause()
	|
	EnumClause()
	|
	StructClause()
	|
	MethodDecl()
	|
	<EOF>
}

The Final Word

JJTree works incredibly well. No excuse to regex parsing no more... Don‘t even try to convince me!

Drop me a line if you need help with JJTree - will be glad to share the experiences with you.

References

  1. The JavaCC FAQ by Theodore S. Norvell
时间: 2024-10-30 15:07:37

JJTree Tutorial for Advanced Java Parsing的相关文章

Get your Advanced Java Programming Degree with these Tutorials and Courses

Getting started as a Java developer these days is quite straightforward. There are countless books on the subject, and of course an abundance of online material to study. 最近,入门成为一名java开发人员是非常简单的.有无相关的书籍,当然还有大量的在线资料可供学习 Of course, our own site offers

Java IO Tutorial

Java Io 1 Java IO Tutorial 2 Java IO Overview 3 Java IO: Files 4 Java IO: Pipes 5 Java IO: Networking 6 Java IO: Byte & Char Arrays 7 Java IO: System.in, System.out, and System.error 8 Java IO: Streams 9 Java IO: Input Parsing 10 Java IO: Readers and

深入浅出 Java Concurrency (11): 锁机制 part 6 CyclicBarrier[转]

如果说CountDownLatch是一次性的,那么CyclicBarrier正好可以循环使用.它允许一组线程互相等待,直到到达某个公共屏障点 (common barrier point).所谓屏障点就是一组任务执行完毕的时刻. 清单1 一个使用CyclicBarrier的例子 package xylz.study.concurrency.lock; import java.util.concurrent.BrokenBarrierException;import java.util.concur

JournalDev 博客的 Java 教程集合(JournalDev Java Tutorials Collections)

Tutorials I have written a lot of posts here into many categories and as the number of post grows, keeping track of them becomes harder. So I have provided a summary post for most of the categories where you can read them in the order for better unde

在线学习Java免费资源推荐

你想学习Java吗?来对地方了!这篇文章将会介绍很多高质量的免费资源,包括网页.论坛.电子书和速查表. Java是一种面向对象的编程语言,拥有独立.多线程.安全.动态和健壮的特点.归功于其多功能的特点,Java已经成为最流行的编程语言之一,可以让你开发出健壮的应用程序. Java几乎是所有商务应用程序的核心.它有多种脚本语言和流行的框架,可以开发客户端和服务端.因此,学习Java不仅仅可以提高你的知识储备,也有利于你在事业上的发展. 这篇文章将介绍各种各样的网络资源(不包含必读的Java书籍),

[转]20个高级Java面试题汇总

http://saebbs.com/forum.php?mod=viewthread&tid=37567&page=1&extra= 这是一个高级Java面试系列题中的第一部分.这一部分论述了可变参数,断言,垃圾回收,初始化器,令牌化,日期,日历等等Java核心问题. 程序员面试指南:https://www.youtube.com/watch?v=0xcgzUdTO5M Java面试问题集合指南:https://www.youtube.com/watch?v=GnR4hCvEIJQ

Java学习资源

最新的科技一般都是先有英文的,所以英语有多重要可想而知.我的英文很烂,一直想学,从来都是说起来容易,想起来简单,做起来最难.只能强迫自己多看些英语方面的技术网站,技术英语两不误. 学习 Java 最好的电子书(PDF) 喜欢阅读的可以通过这些免费的 Java 电子进行自学.大多数在线的电子书都是更新的,完整的.覆盖了 Java 的大多数细节. Official Java Tutorial by Oracle (Sun) 这是 Addison-Wesley 出版社的官方 Java 指南. Java

Java之数组篇

动手动脑,第六次Tutorial--数组 这次的Tutorial讲解了Java中如何进行数组操作,包括数组声明创建使用和赋值运算,写这篇文章的目的就是通过实际运用已达到对数组使用的更加熟练,下面是实践代码之后的感悟与总结: 动手动脑1:PassArray.java  PassArray.java 观察并分析程序的输出结果: 可以得出如下结论: 按引用传递与按值传送数组类型方法参数的最大关键在于: 使用前者时,如果方法中有代码更改了数组元素的值,实际上是直接修改了原始的数组元素. 使用后者则没有这

动手动脑,第六次Tutorial——数组

动手动脑,第六次Tutorial--数组 这次的Tutorial讲解了Java中如何进行数组操作,包括数组声明创建使用和赋值运算,写这篇文章的目的就是通过实际运用已达到对数组使用的更加熟练,下面是实践代码之后的感悟与总结: 动手动脑1:PassArray.java 1 // PassArray.java 2 // Passing arrays and individual array elements to methods 3 4 public class PassArray { 5 6 pub