### Site Tools

cc18:assignment_3

This is an old revision of the document!

A PCRE internal error occured. This might be caused by a faulty plugin

====== Syntax Analysis ====== ===== Introduction ===== This assignment is the second phase of the compiler that you construct in this course. You will develop a parser for [[cc17:eminijava|Extended MiniJava]] in this assignment. A parser typically works on top of a lexer, so you will extend your code from the first phase. Part of your tasks for this assignment is to fix any problems that you had in your lexer code (if any) according to the feedback that you have received. This assignment is rather longer comparing to the previous one. We can only recommend that you start early and try to make sure you understand every step. If you have any questions about this assignment please do not hesitate to ask the instructor. ==== Abstract Syntax Tree ==== Your first task is to define a class hierarchy to represent nodes in the Abstract Syntax Tree (AST) for [[cc17:eminijava|eMiniJava]] programs. If you choose to program in Java, you can refer to the AST structure of the while language in the [[cc18:assignment_1|first assignment]]. The AST of [[cc17:eminijava|eMiniJava]] is an extension of the while language. ==== LL(1) Grammar ==== As we saw in the [[cc18:lecture_10|Lecture 10]], there are several ways to parse languages corresponding to context-free grammars. For example, any context-free grammar (after CNF transformation) can be parsed using the CYK algorithm. However, this algorithm is too slow: its complexity is in O(n$^3$) where n the size of the program. This will not scale for larger programs. On the other hand, a more restricted grammar in form of LL(1) can be parsed in linear time. So it is desirable to develop an LL(1) version of the grammar for [[cc17:eminijava|eMiniJava]]. As we discussed in the [[cc18:lecture_11|Lecture 11]], transforming a grammar to LL(1) is impossible in some cases unless we accept some restrictions to our language. This is the case for [[cc17:eminijava|eMiniJava]]. The problem is with nested if-statements that makes the language ambiguous. This is why we always assume an ''else'' keyword always applies to the closest ''if'': <code java> if(expr1) if(expr2) a=1; else a=2; </code> is equivalent to the code: <code java> if(expr1) { if(expr2) { a=1; } else { a=2; } } </code> and is **not** equivalent to the following code: <code java> if(expr1) { if(expr2) { a=1; } } else { a=2; } </code> This is not too restrictive because we can always use braces. ==== Parser ==== Write a recursive-descent Parser for [[cc17:eminijava|eMiniJava]] by manually writing mutually recursive functions, as sketched in the [[cc18:lecture_12|Lecture 12]] and [[cc18:lecture_13|Lecture 13]] on recursive descent parsing and generating the appropriate Abstract Syntax Trees. Some notes related to parsing input programs: * Recall from our discussions in class that LL parsers may require rewriting the production rules of the grammar (factoring and left-recursion removal) to remove LL violations and ambiguities. * It's not allowed in [[cc17:eminijava|eMiniJava]] to have a variable with a name that matches a keyword. You will never have a variable or a class called ''length'', for instance. (Although that would be allowed in Java) * You need to properly encode the operator precedence as in Java. From highest priority to lowest: **!**, then **%%*%%** and **/**, then **+** and **-**, then **<** and **==**, then **&&**, then **||**. Except for the unary operators, all operators are left-associative. There is no precedence between operators of the same level. For instance: $4 / 3 * 6$ reads as $((4 / 3) * 6)$ and $6 * 4 / 3$ as $((6 * 4) / 3$. Although we highly recommend using the official language of the course (Java) to implement this phase, however, if you are more productive with another language you are allowed to pick that language. ==== Interface ==== Since we are not imposing any code structure for your programs, it is extremely important for your compiler to exactly follow a fixed interface as described here. This allows us to uniformly run and test all the projects from different groups of students. Command-line is the primary interface for your users to interact with your compiler. A general form for the command-line interface is as follows: ''emjc [options] <source file>'' For this phase of assignment, there are three possible options. If the user does not provide a correct option, the compiler treats it the same as the --help option. - ''––help'': Prints a synopsis of options - ''––lex'': Generates output from lexical analysis as described in [[cc17:assignment_2|Assignment 2]]. - ''––ast'': Generates output from syntactic analysis. After executing the compiler with the option ''––ast'' for the source file ''filename.emj'' an output file named ''filename.ast'' is generated to provide the result of parsing the source file. The format of the ''ast'' file is in Lisp-like notation (also known as [[https://en.wikipedia.org/wiki/S-expression|S-expression]]) format. This format represents tree nodes with parenthesized items. Each node has the form (operator operand$_1$ ... operand$_n$) The operands are either tree nodes themselves or they are atoms: quoted strings, integer literals or symbols. For example, for the factorial class <code java> class Fac { public int ComputeFac(int num){ int num_aux ; if (num < 1) num_aux = 1 ; else num_aux = num * (this.ComputeFac(num-1)) ; return num_aux ; } } </code> the output of the parser will be the following S-expression: (CLASS (ID Fac) (MTD-DECL INT (ID ComputeFac) (TY-ID-LIST (INT (ID num))) (BLOCK (VAR-DECL INT (ID num_aux)) (IF (< (ID num) (INTLIT 1)) (EQSIGN (ID num_aux) (INTLIT 1)) (EQSIGN (ID num_aux) (* (ID num) (DOT THIS (FUN-CALL (ID ComputeFac) (- (ID num) (INTLIT 1)))))) ) (RETURN (ID num_aux))))) The operators in the S-expressions are the same as token classes with no arguments, with an extra set of operators that define the structure of the AST: ''MTD-DECL'', ''CLASS-DECL'', ''FUN-CALL'', ''TY-ID-LIST'', ''ID-LIST'', ''VAR-DECL'', ''BLOCK''. You can add more operators according to your structure of your AST definition. However, please do **not** change the operator names that are already defined in this assignment (for example, you are not allowed to use ''BLK'' instead of ''BLOCK''. This will make grading your assignment more difficult). ==== Error Message ==== If the source file is a syntactically invalid [[cc17:eminijava|eMiniJava]] program, the compiler does not generate an ''ast'' file for it. It instead prints out the following line to the output. The compiler should not crash or have any other reaction to errors. <line>:<column> error:<description> where <line> and <column> indicate the beginning position of the error, and <description> details the error. ==== No Parser Generator ==== There are parser generator tools that can automatically generate parsers for LL grammars (e.g. [[http://www.antlr.org/|ANTLR]]). In this assignment you should **not** use any of those tools. Although it seems that using those tools can probably reduce your work, however, implementing a parser manually has the benefit of having more control on your code. If a parser generator fails, it is usually nothing much you can do. You can almost always make a recursive-descent parser work. ==== Benchmarks Corpus ==== Please ensure to test your code extensively before submitting your code. You can compare the outputs of your submission with other groups, but sharing code is forbidden among groups. You may also want to use the reference compiler (the same that was introduced in the [[cc18:assignment_2|Assignment 2]] ) to check your syntax analysis phase. The benchmarks are available in the following folder which is accessible from the CS Department Linux systems (e.g., queeg.cs.rit.edu and ICLs 1,2, and 3). /usr/local/pub/hh/cc/benchmarks ==== Deadline and Deliverables ==== The deadline of the this assignment is February 28th at 11:59pm (https://mycourses.rit.edu). Please upload all your files as a single zip file. You should include a readme.txt file in your package to describe how your source code can be compiled and built. Ideally your project should include a user-friendly compiling technique with Makefile, Ant or similar tools. When grading, we are serious about simple programming errors. You are writing a compiler to check the programs of other people, so your compiler should not have simple errors itself! Your compiler should never crash for an incorrect input. We expect your compiler to give a comprehensive error when the user does not provide a valid input to it.