User Tools

Site Tools


Lexical Analysis


This assignment is the first phase of the compiler that you construct in this course. Extended MiniJava is the language for which you will construct a compiler. Make sure to first understand the language and its constructs and only then start working on implementing the project. If you have any questions about the language please do not hesitate to ask the instructor.

Task 1

Your first task is to write two small eMiniJava programs. We will use the input from each group of students to constitute a corpus of benchmarks. The corpus will be shared among the groups. With these benchmarks you can test different phases of your compiler throughout the course. Try to be creative with your benchmarks, do not write two versions of the same simple program that performs a trivial task such as swapping two numbers. While it is important for your compiler to process all the benchmarks, it is important for you not to limit your testing only to the provided examples. We will gradually add more benchmarks to the corpus. Your (unofficial) goal in this task is to come up with challenging benchmarks to challenge the projects of the other groups!

Benchmarks Corpus

All the benchmarks will be shared in the following folder. This folder is accessible from the CS Department Linux systems (e.g., and ICLs 1,2, and 3). There are a number of benchmarks in this folder from the previous students of the compiler construction course, you can also look at those programs to get an idea of how the Extended MiniJava (eMiniJava) programs are written.


Please submit your benchmark files (eMiniJava programs) with the extension .emj. Try to use meaningful names for your test files, not test.emj or mybenchmark.emj. For example, if you implement bubble sort try using bubblesort.emj or bubble-sort.emj or something like that. Please put a comment at the first line of any of your test files with the names of the people in your group. This will help us to contact you if there is a problem in any of your programs.

//author: Jack Sparrow & Alice Wonderland

Task 2

Write the lexer for eMiniJava. As we discussed in Lecture 3, the main approach to describe the tokens of a language is by using regular expressions. After writing the regular expression description of the tokens, you can manually (Lecture 4) or automatically (Lecture 5) construct the lexer. In this phase you are allowed to choose any of the following techniques.

  • Manual: convert your regular expressions to programs directly. Use the $\text{FIRST}$ set of regular expressions if there are different choices in token description.
  • Automatic: give your regular expressions to JFlex and let the tool automatically generate a lexer for you.

Although we highly recommend using the official language of the course (Java) to implement this phase, however, if you are more productive with another language you are allowed to pick that language.


Since we are not imposing any code structure for your programs, it is extremely important for your compiler to exactly follow a fixed interface as described here. This allows us to uniformly run and test all the projects from different groups of students. Command-line is the primary interface for your users to interact with your compiler. As your compiler matures gradually, your command-line interface will support more possible options. A general form for the command-line interface is as follows:

emjc [options] <source file>

For this phase of assignment, the only possible option is ––lex. For example,

emjc ––lex filename.emj

After executing the command above, an output file named filename.lexed is generated to provide the result of lexing the source file. Each line in the output file corresponds to each token in the source file in the following format: <line>:<column> <token-type> where <line> and <column> indicate the beginning position of the token, and <token-type> is one of the token types of eMiniJava. For example, if the input file contains only the following line:

x = this.func(x - 1);

The content of the generated lexed file is the following:

1:1 ID(x)
1:3 EQSIGN()
1:5 THIS()
1:9 DOT()
1:10 ID(func)
1:14 LPAREN()
1:15 ID(x)
1:17 MINUS()
1:19 INTLIT(1)
1:20 RPAREN()
1:22 EOF()

When reporting the positions of tokens, consider tab (\t) as 4 spaces.

Reference Compiler

There is a reference compiler (compiler.jar) in the following folder written by the grader of the course (Bhavin Navin Shah


You can e.g. run this compiler as the following to check the validity of your benchmarks for the first task. If you found a behavior from the reference compiler that you think does not match the description of Extended MiniJava (eMiniJava) please feel free to send Bhavin an email.

java -jar compiler.jar --type test.emj 

Deadline and Deliverables

The deadline of this project is February 13th at 11:59pm ( Please upload all your files as a single zip file. You should include a readme.txt file in your package to report the following items (not including readme can significantly reduce your grade):

  • The approach that you have chosen for your implementation.
  • A description of the way that your source code can be compiled and built. Ideally your project should include a user-friendly compiling technique with Makefile, Ant or similar tools.
  • A simple description of source files to help us grade your project better.

When grading, we are serious about simple programming errors. You are writing a compiler to check the programs of other people, so your compiler should not have simple errors itself! Your compiler should never crash for an incorrect input. We expect your compiler to give a comprehensive error when the user does not provide a valid input to it.

cc18/assignment_2.txt · Last modified: 2018/02/01 19:10 by hossein