Introduction to Compilers
Course Focus
- Three approaches to a compiler course[1]...
- Software project: Uses compilers to motivate a large software
project
- Language theory: Look at language and computation theory by
examining what compilers can and can't do
- Tool use: Focus is on how compilers work and building up a useful
toolset
- This course is about the last choice: Look at the tricks, techniques,
and tools that compilers use and show how they can be used in domains
other than compiling programming languages.
What is a Compiler?
- We'll use the term language processor.
- A language processor is any program that manipulates
sentences expressed by a particular
grammar.
- This might mean manipulating a program written in a particular
programming language.
- Analysis-Synthesis: analyze the source program to form an intermediate
representation (often a tree), synthesize the desired output format
from the intermediate representation.
- Examples include...
- Editors: allows programs to be entered, modified, and saved
- Translators: translates one high-level language (HLL) to another
HLL
- Compilers: translates one HLL to a low-level language (LLL)
- Interpreters: runs a program immediately
Building Language Translators
- Translator: translates a source language into a target language (e.g.,
Java to C).
- Assembler translates assembly language into machine
code.
- Compiler translates HLL into LLL. These are
themselves programs written in some language.
Use tombstone diagrams to represent programs, compilers,
and to express manipulations. The basic tombstone items are shown
below[2].
(a) shows a program P expressed in
language L. (b) is a representation of machine
M. (c) is a translator that converts language
S to language T, written in
language L. (d) is an interpreter for language
S, written in language
L.
The next figure shows a compilation.
The sort program written in Java,
compiled by
a Java-to-x86 compiler written in x86 running on an x86, produces a sort
program for the x86, which can then be executed.
Next is a cross-compiler.
Interpreters
An interpreter accepts a language and executes it
immediately. Useful when...
- working in an interactive mode
- runtime speed isn't important
- language consists of simple instruction formats
Drawback is that interpreters are very slow.
Examples include the UNIX shell, LISP, SQL.
Here is the tombstone diagram for an interpreter.
Real and Abstract Machines
We can build interpreters for low-level languages. Why? To test a hardware
design before expending the cost of actually building it. This is called an
emulator. Thus, a machine can be viewed as a hardware interpreter.
Conversely, an emulator can be viewed as hardware implemented in software.
Interpretive Compilers
Combination of compiler and interpreter. Translate source into intermediate
language, then interpret the intermediate language. Best known example: Java.
Translates Java to JVM-byte-code, then interprets the byte-code.
Bootstrapping
It is possible that the language used to write the compiler in the language
that the compiler understands itself! This is called
bootstrapping and turns out to be extremely useful.
Bootstrapping an Ada compiler. First, we write a
subset-of-Ada (AdaS) to M compiler in C, then compile it, producing v1.0.
Then we write an Ada to M compiler in AdaS (v2.0). Finally, we compile it
using compiler v1.0, producing compiler v3.0 (a full blown Ada compiler
written in Ada).
Phases of a Compiler
- Lexical Analysis Break source file into words, or
tokens. E.g., x = y + z * 60
- identifier x
- assignment symbol =
- identifier y
- plus sign
- identifier z
- multiplication sign
- number 60
- Parsing Determine if sequence of words is legal, according to
some grammar. Produce abstract syntax tree (AST) or a
parse tree.
- Contextual/Semantic Analysis Determine what each phrase means.
Use scope rules to associate each occurrence of an identifier with a
specific binding occurrence. Use type rules to infer the type of each
identifier and detect type errors.
- Code Generation Convert into machine language. Constants are
bound to a value; replace occurrences of constants with actual values.
Variables are bound to a storage cell; replace occurrences with references
to the memory cell.
We will be concerned with the interface between the phases. Sometimes the
interface will be a data structure (e.g., tree), other times it might be a set
of functions.
Design Issues
Passes: the number of complete traversals through
the source program or its internal representation. Two basic choices:
single-pass or multi-pass.
Single-pass Compiler: Syntax analyzer controls
the other phases. Contextual analysis and code generation are done
on-the-fly while syntax analysis is occurring. Useful only for simple
languages.
Multi-pass Compiler: One compiler "driver" and
three modules: syntax, contextual, and code generation. Driver calls syntax
analyzer which reads the source program and produces an AST. Driver calls
contextual analyzer which reads the AST and produces a decorated AST. Driver
calls code generate which reads the decorated AST and produces machine code.
The compiler makes at least three passes.
Other Compiler Design Issues
- Speed
- One-pass is faster.
E.g., what if AST is stored on disk between
phases?
- Space
- Multi-pass is smaller.
Multi-pass compiler must allocate space to store the AST, but only one
module is active at any one time. One-pass compilers must keep all data
structures active all the time.
- Modularity
- Multi-pass compilers are typically more modular.
- Flexibility
- Multi-pass wins
here too. Later phases don't rely on the workings of the syntax analyzer.
Each phase can be customized individually.
- Code optimization
- Multi-pass
wins. Code optimization requires analysis of large portions of the source
code before code generation.
- Language properties
- Multi-pass is required for some language properties. E.g.,
if you don't have to declare a variable before it's used, you need a
multi-pass compiler.
Language Specification
We must specify:
- Syntax: concerned with the form of programs, symbols used,
how phrases and sub-phrases can be generated, commands, expressions,
declarations, etc.
- Contextual constraints: scope rules, type rules (called
"contextual" because correct use depends on context).
- Semantics: meaning of programs.
Two methods of specification. First, formal: unambiguous,
precise, complete description. Only decipherable by those that know the
notation (see Formal
Semantics for more information). Second informal: specification in
English. Difficult to do well.
We'll use both methods: syntax uses a formal notation, contextual
constraints and semantics use information notation.
References
[1] The Compiler Course in Today's Curriculum: Three Strategies. William Waite. Proceedings of SIGCSE 2006.
[2] Modern Compiler Implementation in Java. 2nd ed. A. Appel. Cambridge University Press. 2005