User Tools

Site Tools


cc18:assignment_3

Syntax Analysis

Introduction

This assignment is the second phase of the compiler that you construct in this course. You will develop a parser for Extended MiniJava in this assignment. A parser typically works on top of a lexer, so you will extend your code from the first phase. Part of your tasks for this assignment is to fix any problems that you had in your lexer code (if any) according to the feedback that you have received. This assignment is rather longer comparing to the previous one. We can only recommend that you start early and try to make sure you understand every step. If you have any questions about this assignment please do not hesitate to ask the instructor.

Abstract Syntax Tree

Your first task is to define a class hierarchy to represent nodes in the Abstract Syntax Tree (AST) for eMiniJava programs. If you choose to program in Java, you can refer to the AST structure of the while language in the first assignment. The AST of eMiniJava is an extension of the while language.

LL(1) Grammar

As we saw in the Lecture 10, there are several ways to parse languages corresponding to context-free grammars. For example, any context-free grammar (after CNF transformation) can be parsed using the CYK algorithm. However, this algorithm is too slow: its complexity is in O(n$^3$) where n the size of the program. This will not scale for larger programs. On the other hand, a more restricted grammar in form of LL(1) can be parsed in linear time. So it is desirable to develop an LL(1) version of the grammar for eMiniJava.

As we discussed in the Lecture 11, transforming a grammar to LL(1) is impossible in some cases unless we accept some restrictions to our language. This is the case for eMiniJava. The problem is with nested if-statements that makes the language ambiguous. This is why we always assume an else keyword always applies to the closest if:

if(expr1) if(expr2) a=1; else a=2;

is equivalent to the code:

if(expr1) { if(expr2) { a=1; } else { a=2; } }

and is not equivalent to the following code:

if(expr1) { if(expr2) { a=1; } } else { a=2; }

This is not too restrictive because we can always use braces.

Parser

Write a recursive-descent Parser for eMiniJava by manually writing mutually recursive functions, as sketched in the Lecture 12 and Lecture 13 on recursive descent parsing and generating the appropriate Abstract Syntax Trees. Some notes related to parsing input programs:

  • Recall from our discussions in class that LL parsers may require rewriting the production rules of the grammar (factoring and left-recursion removal) to remove LL violations and ambiguities.
  • It's not allowed in eMiniJava to have a variable with a name that matches a keyword. You will never have a variable or a class called length, for instance. (Although that would be allowed in Java)
  • You need to properly encode the operator precedence as in Java. From highest priority to lowest: !, then * and /, then + and -, then < and ==, then &&, then ||. Except for the unary operators, all operators are left-associative. There is no precedence between operators of the same level. For instance: $4 / 3 * 6$ reads as $((4 / 3) * 6)$ and $6 * 4 / 3$ as $((6 * 4) / 3$.

Although we highly recommend using the official language of the course (Java) to implement this phase, however, if you are more productive with another language you are allowed to pick that language.

Interface

Since we are not imposing any code structure for your programs, it is extremely important for your compiler to exactly follow a fixed interface as described here. This allows us to uniformly run and test all the projects from different groups of students. Command-line is the primary interface for your users to interact with your compiler. A general form for the command-line interface is as follows:

emjc [options] <source file>

For this phase of assignment, there are three possible options. If the user does not provide a correct option, the compiler treats it the same as the –help option.

  1. ––help: Prints a synopsis of options
  2. ––lex: Generates output from lexical analysis as described in the Assignment 2.
  3. ––ast: Generates output from syntactic analysis.

After executing the compiler with the option ––ast for the source file filename.emj an output file named filename.ast is generated to provide the result of parsing the source file. The format of the ast file is in Lisp-like notation (also known as S-expression) format. This format represents tree nodes with parenthesized items. Each node has the form

(operator operand$_1$ … operand$_n$)

The operands are either tree nodes themselves or they are atoms: quoted strings, integer literals or symbols. For example, for the factorial class

class Fac {
    public int ComputeFac(int num){
	int num_aux ;
	if (num < 1)
	    num_aux = 1 ;
	else 
	    num_aux = num * (this.ComputeFac(num-1)) ;
	return num_aux ;
    }
}

the output of the parser will be the following S-expression:

  (CLASS (ID Fac) 
  (MTD-DECL INT (ID ComputeFac) (TY-ID-LIST (INT (ID num)))
  (BLOCK
  (VAR-DECL INT (ID num_aux))
  (IF (< (ID num) (INTLIT 1))
    (EQSIGN (ID num_aux) (INTLIT 1))
    (EQSIGN (ID num_aux) (* (ID num) (DOT THIS (FUN-CALL (ID ComputeFac) (- (ID num) (INTLIT 1))))))
  )
  (RETURN (ID num_aux)))))

The operators in the S-expressions are the same as token classes with no arguments, with an extra set of operators that define the structure of the AST: MTD-DECL, CLASS-DECL, FUN-CALL, TY-ID-LIST, ID-LIST, VAR-DECL, BLOCK. You can add more operators according to your structure of your AST definition. However, please do not change the operator names that are already defined in this assignment (for example, you are not allowed to use BLK instead of BLOCK. This will make grading your assignment more difficult).

Error Message

If the source file is a syntactically invalid eMiniJava program, the compiler does not generate an ast file for it. It instead prints out the following line to the output. The compiler should not crash or have any other reaction to errors.

<line>:<column> error:<description>

where <line> and <column> indicate the beginning position of the error, and <description> details the error.

No Parser Generator

There are parser generator tools that can automatically generate parsers for LL grammars (e.g. ANTLR). In this assignment you should not use any of those tools. Although it seems that using those tools can probably reduce your work, however, implementing a parser manually has the benefit of having more control on your code. If a parser generator fails, it is usually nothing much you can do. You can almost always make a recursive-descent parser work.

Benchmarks Corpus

Please ensure to test your code extensively before submitting your code. You can compare the outputs of your submission with other groups, but sharing code is forbidden among groups. You may also want to use the reference compiler (the same that was introduced in the Assignment 2 ) to check your syntax analysis phase. The benchmarks are available in the following folder which is accessible from the CS Department Linux systems (e.g., queeg.cs.rit.edu and ICLs 1,2, and 3).

/usr/local/pub/hh/cc/benchmarks

Deadline and Deliverables

The deadline of the this assignment is February 28th at 11:59pm (https://mycourses.rit.edu). Please upload all your files as a single zip file. You should include a readme.txt file in your package to describe how your source code can be compiled and built. Ideally your project should include a user-friendly compiling technique with Makefile, Ant or similar tools.

When grading, we are serious about simple programming errors. You are writing a compiler to check the programs of other people, so your compiler should not have simple errors itself! Your compiler should never crash for an incorrect input. We expect your compiler to give a comprehensive error when the user does not provide a valid input to it.

cc18/assignment_3.txt · Last modified: 2018/02/14 13:21 by hossein