Introduction to Compilers

Course Focus

What is a Compiler?

Building Language Translators

Use tombstone diagrams to represent programs, compilers, and to express manipulations. The basic tombstone items are shown below[2].

(a) shows a program P expressed in language L. (b) is a representation of machine M. (c) is a translator that converts language S to language T, written in language L. (d) is an interpreter for language S, written in language L.

The next figure shows a compilation.

The sort program written in Java, compiled by a Java-to-x86 compiler written in x86 running on an x86, produces a sort program for the x86, which can then be executed.

Next is a cross-compiler.

Interpreters

An interpreter accepts a language and executes it immediately. Useful when...

  1. working in an interactive mode
  2. runtime speed isn't important
  3. language consists of simple instruction formats

Drawback is that interpreters are very slow.

Examples include the UNIX shell, LISP, SQL.

Here is the tombstone diagram for an interpreter.

Real and Abstract Machines

We can build interpreters for low-level languages. Why? To test a hardware design before expending the cost of actually building it. This is called an emulator. Thus, a machine can be viewed as a hardware interpreter. Conversely, an emulator can be viewed as hardware implemented in software.

Interpretive Compilers

Combination of compiler and interpreter. Translate source into intermediate language, then interpret the intermediate language. Best known example: Java. Translates Java to JVM-byte-code, then interprets the byte-code.

Bootstrapping

It is possible that the language used to write the compiler in the language that the compiler understands itself! This is called bootstrapping and turns out to be extremely useful.

Bootstrapping an Ada compiler. First, we write a subset-of-Ada (AdaS) to M compiler in C, then compile it, producing v1.0. Then we write an Ada to M compiler in AdaS (v2.0). Finally, we compile it using compiler v1.0, producing compiler v3.0 (a full blown Ada compiler written in Ada).

Phases of a Compiler

  1. Lexical Analysis Break source file into words, or tokens. E.g., x = y + z * 60
  2. Parsing Determine if sequence of words is legal, according to some grammar. Produce abstract syntax tree (AST) or a parse tree.
  3. Contextual/Semantic Analysis Determine what each phrase means. Use scope rules to associate each occurrence of an identifier with a specific binding occurrence. Use type rules to infer the type of each identifier and detect type errors.
  4. Code Generation Convert into machine language. Constants are bound to a value; replace occurrences of constants with actual values. Variables are bound to a storage cell; replace occurrences with references to the memory cell.

We will be concerned with the interface between the phases. Sometimes the interface will be a data structure (e.g., tree), other times it might be a set of functions.

Design Issues

Passes: the number of complete traversals through the source program or its internal representation. Two basic choices: single-pass or multi-pass.

Single-pass Compiler: Syntax analyzer controls the other phases. Contextual analysis and code generation are done on-the-fly while syntax analysis is occurring. Useful only for simple languages.

Multi-pass Compiler: One compiler "driver" and three modules: syntax, contextual, and code generation. Driver calls syntax analyzer which reads the source program and produces an AST. Driver calls contextual analyzer which reads the AST and produces a decorated AST. Driver calls code generate which reads the decorated AST and produces machine code. The compiler makes at least three passes.

Other Compiler Design Issues

Speed
One-pass is faster. E.g., what if AST is stored on disk between phases?
Space
Multi-pass is smaller. Multi-pass compiler must allocate space to store the AST, but only one module is active at any one time. One-pass compilers must keep all data structures active all the time.
Modularity
Multi-pass compilers are typically more modular.
Flexibility
Multi-pass wins here too. Later phases don't rely on the workings of the syntax analyzer. Each phase can be customized individually.
Code optimization
Multi-pass wins. Code optimization requires analysis of large portions of the source code before code generation.
Language properties
Multi-pass is required for some language properties. E.g., if you don't have to declare a variable before it's used, you need a multi-pass compiler.

Language Specification

We must specify:

Two methods of specification. First, formal: unambiguous, precise, complete description. Only decipherable by those that know the notation (see Formal Semantics for more information). Second informal: specification in English. Difficult to do well.

We'll use both methods: syntax uses a formal notation, contextual constraints and semantics use information notation.

References

[1] The Compiler Course in Today's Curriculum: Three Strategies. William Waite. Proceedings of SIGCSE 2006.

[2] Modern Compiler Implementation in Java. 2nd ed. A. Appel. Cambridge University Press. 2005