Languages and Compilers (SProg og Oversættere)

Languages and Compilers(SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinsonwhose slides this lecture is based on.

Curricula (Studie ordning) The purpose of the course is for the student to gain knowledge of important principles in programming languages and for the student to gain an understanding of techniques for describing and compiling programming languages.

What was this course about? • Programming Language Design • Concepts and Paradigms • Ideas and philosophy • Syntax and Semantics • Compiler Construction • Tools and Techniques • Implementations • The nuts and bolts

The principal paradigms • Imperative Programming (C) • Object-Oriented Programming (C++) • Logic/Declarative Programming (Prolog) • Functional/Applicative Programming (Lisp) • New paradigms? • Agent Oriented Programming • Business Process Oriented (Web computing) • Grid Oriented • Aspect Oriented Programming

Criteria in a good language design • Writability: The quality of a language that enables a programmer to use it to express a computation clearly, correctly, concisely, and quickly. • Readability: The quality of a language that enables a programmer to understand and comprehend the nature of a computation easily and accurately. • Orthogonality: The quality of a language that features provided have as few restrictions as possible and be combinable in any meaningful way. • Reliability: The quality of a language that assures a program will not behave in unexpected or disastrous ways during execution. • Maintainability: The quality of a language that eases errors can be found and corrected and new features added.

Criteria (Continued) • Generality: The quality of a language that avoids special cases in the availability or use of constructs and by combining closely related constructs into a single more general one. • Uniformity: The quality of a language that similar features should look similar and behave similar. • Extensibility: The quality of a language that provides some general mechanism for the user to add new constructs to a language. • Standardability: The quality of a language that allows programs written to be transported from one computer to another without significant change in language structure. • Implementability: The quality of a language that provides a translator or interpreter can be written. This can address to complexity of the language definition.

Important! • Syntax is the visible part of a programming language • Programming Language designers can waste a lot of time discussing unimportant details of syntax • The language paradigm is the next most visible part • The choice of paradigm, and therefore language, depends on how humans best think about the problem • There are no right models of computations – just different models of computations, some more suited for certain classes of problems than others • The most invisible part is the language semantics • Clear semantics usually leads to simple and efficient implementations

Levels of Programming Languages High-level program class Triangle { ... float surface() return b*h/2; } Low-level program LOAD r1,b LOAD r2,h MUL r1,r2 DIV r1,#2 RET Executable Machine code 0001001001000101001001001110110010101101001...

Terminology is expressed in the source language is expressed in the target language is expressed in the implementation language Q: Which programming languages play a role in this picture? input Translator output source program object program A: All of them!

Tombstone Diagrams Program P implemented in L Translator implemented in L S -> T L P Machine implemented in hardware Language interpreter in L L M M L What are they? • diagrams consisting out of a set of “puzzle pieces” we can use to reason about language processors and programs • different kinds of pieces • combination rules (not all diagrams are “well formed”)

Syntax Specification Syntax is specified using “Context Free Grammars”: • A finite set of terminal symbols • A finite set of non-terminal symbols • A start symbol • A finite set of production rules Usually CFG are written in “Bachus Naur Form” or BNF notation. A production rule in BNF notation is written as: N ::= a where N is a non terminal and a a sequence of terminals and non-terminals N ::= a | b | ... is an abbreviation for several rules with N as left-hand side.

Concrete and Abstract Syntax The previous grammar specified the concrete syntax of mini triangle. The concrete syntax is important for the programmer who needs to know exactly how to write syntactically well-formed programs. The abstract syntax omits irrelevant syntactic details and only specifies the essential structure of programs. Example: different concrete syntaxes for an assignment v := e (set! v e) e -> v v = e

Abstract Syntax Trees Abstract Syntax Tree for: d:=d+10*n AssignmentCmd BinaryExpression BinaryExpression VName VNameExp IntegerExp VNameExp SimpleVName SimpleVName SimpleVName Int-Lit Ident Ident Op Ident Op + 10 d d n *

Contextual Constraints Undefined! Scope Rules Type Rules Type error! Syntax rules alone are not enough to specify the format of well-formed programs. Example 1: let const m~2 in m + x Example 2: let const m~2 ; var n:Boolean in begin n := m<4; n := n+1 end

Semantics Specification of semantics is concerned with specifying the “meaning” of well-formed programs. • Terminology: • Expressions are evaluated and yield values (and may or may not perform side effects) • Commands are executed and perform side effects. • Declarations are elaborated to produce bindings • Side effects: • change the values of variables • perform input/output

Phases of a Compiler A compiler’s phases are steps in transforming source code into object code. The different phases correspond roughly to the different parts of the language specification: • Syntax analysis <-> Syntax • Contextual analysis <-> Contextual constraints • Code generation <-> Semantics

The “Phases” of a Compiler Source Program Syntax Analysis Error Reports Abstract Syntax Tree Contextual Analysis Error Reports Decorated Abstract Syntax Tree Code Generation Object Code

Compiler Passes • A pass is a complete traversal of the source program, or a complete traversal of some internal representation of the source program. • A pass can correspond to a “phase” but it does not have to! • Sometimes a single “pass” corresponds to several phases that are interleaved in time. • What and how many passes a compiler does over the source program is an important design decision.

Single Pass Compiler A single pass compiler makes a single pass over the source text, parsing, analyzing and generating code all at once. Dependency diagram of a typical Single Pass Compiler: Compiler Driver calls Syntactic Analyzer calls calls Contextual Analyzer Code Generator

Multi Pass Compiler input input input output output output Source Text AST Decorated AST Object Code A multi pass compiler makes several passes over the program. The output of a preceding phase is stored in a data structure and used by subsequent phases. Dependency diagram of a typical Multi Pass Compiler: Compiler Driver calls calls calls Syntactic Analyzer Contextual Analyzer Code Generator

Syntax Analysis Dataflow chart Source Program Stream of Characters Scanner Error Reports Stream of “Tokens” Parser Error Reports Abstract Syntax Tree

Regular Expressions • RE are a notation for expressing a set of strings of terminal symbols. • Different kinds of RE: • e The empty string • t Generates only the string t • X Y Generates any string xy such that x is generated by x • and y is generated by Y • X | Y Generates any string which generated either • by X or by Y • X* The concatenation of zero or more strings generated • by X • (X) For grouping,

FA and the implementation of Scanners • Regular expressions, (N)DFA-e and NDFA and DFA’s are all equivalent formalism in terms of what languages can be defined with them. • Regular expressions are a convenient notation for describing the “tokens” of programming languages. • Regular expressions can be converted into FA’s (the algorithm for conversion into NDFA-e is straightforward) • DFA’s can be easily implemented as computer programs.

Parsing Parsing == Recognition + determining phrase structure (for example by generating AST) • Different types of parsing strategies • bottom up • top down • Recursive descent parsing • What is it • How to implement one given an EBNF specification • Bottom up parsing algorithms

Top-Down vs Bottom-Up parsing LR-Analyse (Bottom-Up) LL-Analyse (Top-Down) Reduction Derivation Look-Ahead Look-Ahead

Development of Recursive Descent Parser (1) Express grammar in EBNF (2) Grammar Transformations: Left factorization and Left recursion elimination (3) Create a parser class with • private variable currentToken • methods to call the scanner: accept and acceptIt (4) Implement private parsing methods: • add private parseNmethod for each non terminal N • public parsemethod that • gets the first token form the scanner • calls parseS (S is the start symbol of the grammar)

LL 1 Grammars • The presented algorithm to convert EBNF into a parser does not work for all possible grammars. • It only works for so called “LL 1” grammars. • Basically, an LL1 grammar is a grammar which can be parsed with a top-down parser with a lookahead (in the input stream of tokens) of one token. • What grammars are LL1? How can we recognize that a grammar is (or is not) LL1? • We can deduce the necessary conditions from the parser generation algorithm. • We can use a formal definition

Converting EBNF into RD parsers • The conversion of an EBNF specification into a Java implementation for a recursive descent parser is so “mechanical” that it can easily be automated! • => JavaCC “Java Compiler Compiler”

JavaCC and JJTree

LR parsing • The algorithm makes use of a stack. • The first item on the stack is the initial state of a DFA • A state of the automaton is a set of LR0/LR1 items. • The initial state is constructed from productions of the form S:= •a [, $] (where S is the start symbol of the CFG) • The stack contains (in alternating) order: • A DFA state • A terminal symbol or part (subtree) of the parse tree being constructed • The items on the stack are related by transitions of the DFA • There are two basic actions in the algorithm: • shift: get next input token • reduce: build a new node (remove children from stack)

Bottom Up Parsers: Overview of Algorithms • LR0 : The simplest algorithm, theoretically important but rather weak (not practical) • SLR : An improved version of LR0 more practical but still rather weak. • LR1 : LR0 algorithm with extra lookahead token. • very powerful algorithm. Not often used because of large memory requirements (very big parsing tables) • LALR : “Watered down” version of LR1 • still very powerful, but has much smaller parsing tables • most commonly used algorithm today

JavaCUP: A LALR generator for Java Definition of tokens Regular Expressions Grammar BNF-like Specification JFlex JavaCUP Java File: Scanner Class Recognizes Tokens Java File: Parser Class Uses Scanner to get TokensParses Stream of Tokens Syntactic Analyzer

Steps to build a compiler with SableCC • Create a SableCC specification file • Call SableCC • Create one or more working classes, possibly inherited from classes generated by SableCC • Create a Main class activating lexer, parser and working classes • Compile with Javac

Contextual Analysis Phase • Purposes: • Finish syntax analysis by deriving context-sensitive information • Associate semantic routines with individual productions of the context free grammar or subtrees of the AST • Start to interpret meaning of program based on its syntactic structure • Prepare for the final stage of compilation: Code generation

Contextual Analysis -> Decorated AST VarDecl SimpleT Annotations: Program result of identification LetCommand :typeresult of type checking SequentialCommand SequentialDeclaration AssignCommand :int AssignCommand BinaryExpr VarDecl Char.Expr VNameExp Int.Expr :char :int :int :int SimpleT SimpleV SimpleV :char :int Ident Ident Ident Ident Ident Char.Lit Ident Ident Op Int.Lit n c n n Integer Char c ‘&’ + 1

Nested Block Structure A language exhibits nested block structure if blocks may be nested one within another (typically with no upper bound on the level of nesting that is allowed). Nested • There can be any number of scope levels (depending on the level of nesting of blocks): • Typical scope rules: • no identifier may be declared more than once within the same block (at the same level). • for any applied occurrence there must be a corresponding declaration, either within the same block or in a block in which it is nested.

Type Checking For most statically typed programming languages, a bottom up algorithm over the AST: • Types of expression AST leaves are known immediately: • literals => obvious • variables => from the ID table • named constants => from the ID table • Types of internal nodes are inferred from the type of the children and the type rule for that kind of expression

Contextual Analysis Identification and type checking are combined into a depth-first traversal of the abstract syntax tree. Program LetCommand SequentialDeclaration SequentialCommand AssignCommand AssignCommand BinaryExpression VarDec VarDec CharExpr VnameExpr IntExpr SimpleT SimpleT SimpleV SimpleV SimpleV Ident Ident Ident Ident Ident CharLit Ident Ident Op IntLit n Integer c Char c ‘&’ n n + 1

Visitor Solution Node Accept( NodeVisitor v ) NodeVisitor TypeCheckingVisitor CodeGeneratingVisitor VisitAssignment( AssignmentNode ) VisitVariableRef( VariableRefNode ) VisitAssignment( AssignmentNode ) VisitVariableRef( VariableRefNode ) VisitAssignment( AssignmentNode ) VisitVariableRef( VariableRefNode ) VariableRefNode AssignmentNode Accept(NodeVisitor v) {v->VisitVariableRef(this)} Accept(NodeVisitor v) {v->VisitAssignment(this)} • Nodes accept visitors and call appropriate method of the visitor • Visitors implement the operations and have one method for each type of node they visit

Runtime organization • Data Representation: how to represent values of the source language on the target machine. • Primitives, arrays, structures, unions, pointers • Expression Evaluation: How to organize computing the values of expressions (taking care of intermediate results) • Register vs. stack machine • Storage Allocation: How to organize storage for variables (considering different lifetimes of global, local and heap variables) • Activation records, static links • Routines: How to implement procedures, functions (and how to pass their parameters and return values) • Value vs. reference, closures, recursion • Object Orientation: Runtime organization for OO languages • Method tables

RECAP: TAM Frame Layout Summary Arguments for current procedure they were put here by the caller. arguments LB dynamic link static link return address Link data local variables and intermediate results Local data, grows and shrinks during execution. ST

Garbage Collection: Conclusions • Relieves the burden of explicit memory allocation and deallocation. • Software module coupling related to memory management issues is eliminated. • An extremely dangerous class of bugs is eliminated. • The compiler generates code for allocating objects • The compiler must also generate code to support GC • The GC must be able to recognize root pointers from the stack • The GC must know about data-layout and objects descriptors

Code Generation ~ ~ Source Program Target program let var n: integer; var c: charin begin c := ‘&’; n := n+1end PUSH 2LOADL 38STORE 1[SB]LOAD 0LOADL 1CALL addSTORE 0[SB]POP 2HALT Source and target program must be “semantically equivalent” Semantic specification of the source language is structured in terms of phrases in the SL: expressions, commands, etc. => Code generation follows the same “inductive” structure.

Specifying Code Generation with Code Templates The code generation functions for Mini Triangle Phrase Class Function Effect of the generated code Run program P then halt. Starting and finishing with empty stack Execute Command C. May update variables but does not shrink or grow the stack! Evaluate E, net result is pushing the value of E on the stack. Push value of constant or variable on the stack. Pop value from stack and store in variable V Elaborate declaration, make space on the stack for constants and variables in the decl. run P executeC evaluateE fetchV assignV elaborate D Program Command Expres- sion V-name V-name Decla-ration

Code Generation with Code Templates While command • execute [whileE doC] = • JUMP h • g: execute [C] • h: evaluate[E] • JUMPIF(1) g C E

Developing a Code Generator “Visitor” • execute [C1;C2] = • execute[C1] • execute[C2] public Object visitSequentialCommand( SequentialCommand com,Object arg) { com.C1.visit(this,arg); com.C2.visit(this,arg); return null; } LetCommand, IfCommand, WhileCommand => later. - LetCommand is more complex: memory allocation and addresses - IfCommand and WhileCommand: complications with jumps

Code improvement (optimization) The code generated by our compiler is not efficient: • It computes values at runtime that could be known at compile time • It computes values more times than necessary We can do better! • Constant folding • Common sub-expression elimination • Code motion • Dead code elimination

Optimization implementation • Is the optimization correct or safe? • Is the optimization an improvement? • What sort of analyses do we need to perform to get the required information? • Local • Global

Concurrency, distributed computing, the Internet • Traditional view: • Let the OS deal with this • => It is not a programming language issue! • End of Lecture • Wait-a-minute … • Maybe “the traditional view” is getting out of date?

Languages with concurrency constructs • Maybe the “traditional view” was always out of date? • Simula • Modula3 • Occam • Concurrent Pascal • ADA • Linda • CML • Facile • Jo-Caml • Java • C# • …

Languages and Compilers (SProg og Oversættere)