600 likes | 616 Views
CS 3304 Comparative Languages. Lecture 15: Midterm Exam Review 13 March 2012. Why Study Programming Languages? (Ch. 1). Help you choose a language: Make it easier to learn new languages some languages are similar; easy to walk down family tree.
E N D
CS 3304Comparative Languages • Lecture 15:Midterm Exam Review • 13 March 2012
Why Study Programming Languages? (Ch. 1) • Help you choose a language: • Make it easier to learn new languages some languages are similar; easy to walk down family tree. • Help you make better use of whatever language you use: • Understand obscure features. • Understand implementation costs: choose between alternative ways of doing things, based on knowledge of what will be done underneath. • Figure out how to do things in languages that don't support them explicitly. • Figure out how to do things in languages that don't support them explicitly.
Language Groups • Imperative: • von Neumann (Fortran, Pascal, Basic, C). • Object-oriented (Smalltalk, Eiffel, C++?). • Scripting languages (Perl, Python, JavaScript, PHP). • Declarative: • Functional (Scheme, ML, pure Lisp, FP). • Logic, constraint-based (Prolog, VisiCalc, RPG). • Imperative languages, particularly the von Neumann languages, predominate: • They will occupy the bulk of our attention. • We also plan to spend a lot of time on functional, logic languages.
The von Neumann Architecture (Ch. 2) • Fetch-execute-cycle (on a von Neumann architecture computer) initialize the program counter repeat forever fetch the instruction pointed by the counter increment the counter decode the instruction execute the instruction end repeat
Programming Methodologies • 1950s and early 1960s - Simple applications; worry about machine efficiency. • Late 1960s - People efficiency became important; readability, better control structures: • Structured programming. • Top-down design and step-wise refinement. • Late 1970s - Process-oriented to data-oriented: • Data abstraction. • Middle 1980s - Object-oriented programming: • Data abstraction + inheritance + polymorphism.
Compilation vs. Interpretation • Not opposites. • Not a clear-cut distinction. • Interpretation: • Greater flexibility. • Better diagnostics (error messages). • Compilation: • Better performance.
Defining Languages • Recognizers: • A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. • Example: syntax analysis part of a compiler (scanning). • Generators: • A device that generates sentences of a language. • One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.
Regular Expressions • A regular expression is one of the following: • A character. • The empty string, denoted by ε. • Two regular expressions concatenated. • Two regular expressions separated by | (i.e., or). • A regular expression followed by the Kleene star (concatenation of zero or more strings). • Numerical literals in Pascal may be generated by the following:
Context-Free Grammars • Context-Free Grammars: • Developed by Noam Chomsky in the mid-1950s. • Language generators, meant to describe the syntax of natural languages. • Define a class of languages called context-free languages. • Backus-Naur Form (1959): • Invented by John Backus to describe Algol 58. • BNF is equivalent to context-free grammars (CFGs). • A CFG consists of: • A set of terminals T. • A set of non-terminals N. • A start symbol S (a non-terminal). • A set of productions.
BNF Fundamentals • In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just terminals). • Terminals are lexemes or tokens. • A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. • Nonterminals are often italic or enclosed in angle brackets. • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> • Grammar: a finite non-empty set of rules. • A start symbol is a special element of the nonterminals of a grammar.
Scanner Responsibilities • Tokenizing source. • Removing comments. • (Often) dealing with pragmas (i.e., significant comments). • Saving text of identifiers, numbers, strings. • Saving source locations (file, line, column) for error messages.
Deterministic Finite Automaton • Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. • This is a deterministic finite automaton (DFA): • Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. • Specifically, they construct a machine that accepts the language.
Building Scanners • Scanners tend to be built three ways: • Ad-hoc. • Semi-mechanical pure DFA (usually as nested case statements). • Table-driven DFA. • Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. • Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): • It is often easier to use perl, awk, sed or similar tools. • Table-driven DFA is what lex and scangen produce: • lex (flex): C code • scangen: numeric tables and a separate driver (Figure 2.12). • ANTLR: Java code.
Parsing • By analogy to regular expressions and DFAs, a context-free grammar (CFG) is a generator for a context-free language (CFL): • A parser is a language recognizer. • There is an infinite number of grammars for every context-free language: • Not all grammars are created equal, however. • It turns out that for any CFG we can create a parser that runs in O(n3) time. • There are two well-known parsing algorithms that permit this: • Early's algorithm. • Cooke-Younger-Kasami (CYK) algorithm. • O(n3) is unacceptable for a parser in a compiler: too slow.
Faster Parsing • Fortunately, there are large classes of grammars for which we can build parsers that run in linear time: • The two most important classes are called LL and LR. • LL stands for ‘Left-to-right, Leftmost derivation’. • LR stands for ‘Left-to-right, Rightmost derivation’. • LL parsers are also called ‘top-down’, or 'predictive' parsers & LR parsers are also called ‘bottom-up’, or 'shift-reduce' parsers. • There are several important sub-classes of LR parsers. • Simple LR parser (SLR). • Look-ahead LR parser (LALR). • We won't be going into detail on the differences between them.
LL Parsing • Like the bottom-up grammar, this one captures associativity and precedence, but most people don't find it as pretty: • For one thing, the operands of a given operator aren't in a RHS together! • However, the simplicity of the parsing algorithm makes up for this weakness. • How do we parse a string with this grammar? • By building the parse tree incrementally.
LR Parsing • LR parsers are almost always table-driven: • Like a table-driven LL parser, an LR parser uses a big loop to repeatedly inspects a two-dimensional table to find out what action to take. • Unlike the LL parser, the LR driver has non-trivial state (like a DFA), and the table is indexed by current input token and current state. • The stack contains a record of what has been seen SO FAR (NOT what is expected).
Core Issues (Ch. 3) • The early development of programming languages was driven by two complementary goals, machine independence and ease of programming. • Machine Independence: a programming language should not rely on the features of any particular instruction set for its efficient implementation (e.g., Java). • Ease of programming: more elusive and a matter of science than of aesthetics and trial and error. • Core issues for the midterm: • Names, scopes, and bindings: Chapter 3. • Control-flow constructs: Chapter 6. • Data types: Chapter 7.
Name, Scope, and Binding • A name is a mnemonic character string used to represent something else: • Most names are identifiers. • Symbols (like '+') can also be names. • A binding is an association between two things, such as a name and the thing it names. • The scope of a binding is the part of the program (textually) in which the binding is active. • Binding Time is the point at which a binding is created or, more generally, the point at which any implementation decision is made. • The terms “static” and “dynamic” are generally used to refer to things bound before run time and at run time, respectively.
Storage Allocation Mechanisms • Static: objects are given an absolute address that is retained throughout the program’s execution. • Stack: objects are allocated and deallocated in last-in, first-out order, usually in conjunction with subroutine calls and returns. • Heap: objects may be allocated and deallocated at arbitrary times. They require a more general (and expensive) storage management algorithm.
Static Allocation Examples • Global variables: accessible throughout the program. • Code: the machine instructions. • Static local variables: retain their values from one invocation to the next. • Explicit constants (including strings, sets, etc.): • Small constants may be stored in the instructions. • Tables: most compilers produce a variety of tables used by runtime support routines (debugging, dynamic-type checking, garbage collection, exception handling).
Stack • Central stack for parameters,local variables and temporaries. • Why a stack? • Allocate space for recursive routines: not necessary if no recursion. • Reuse space: in all programming languages. • Contents of a stack frame (Figure 3.1): arguments and returns, local variables, temporaries, bookkeeping (saved registers, line number static link, etc.). • Local variables and arguments are assigned fixed offsets from the stack pointer or frame pointer at compile time. • Maintenance of stack is responsibility of calling sequence and subroutine prolog and epilog: • Saving space: putting as much in the prolog and epilog as possible. • Time may be saved by putting stuff in the caller instead or combining what's known in both places (interprocedural optimization).
Heap-Based Allocation • Heap: a region of storage in which subblocks can be allocated and deallocated at arbitrary times. • Heap space management: speed vs. spacetradeoffs. • Space concerns: • Internal fragmentation: allocating a block large than required. • External fragmentation: unused space is fragmented so the ability to meet allocation requests degrades over time.
Static Scoping • Static (lexical) scope rules: a scope is defined in terms of the physical (lexical) structure of the program: • The determination of scopes can be made by the compiler. • All bindings for identifiers can be resolved by examining the program. • Typically, we choose the most recent, active binding made at compile time. • Most compiled languages, C and Pascal included, employ static scope rules. • The classical example of static scope rules is the most closely nested rule used in block structured languages such as Algol 60 and Pascal (nested subroutines): • An identifier is known in the scope in which it is declared and in each enclosed scope, unless it is re-declared in an enclosed scope. • To resolve a reference to an identifier, we examine the local scope and statically enclosing scopes until a binding is found.
Dynamic Scoping • The key idea in static scope rules is that bindings are defined by the physical (lexical) structure of the program. • With dynamic scope rules, bindings depend on the current state of program execution: • They cannot always be resolved by examining the program because they are dependent on calling sequences. • To resolve a reference, we use the most recent, active binding made at run time. • Dynamic scope rules are usually encountered in interpreted languages: • Early LISP dialects assumed dynamic scope rules. • Such languages do not normally have type checking at compile time because type determination isn’t always possible when dynamic scope rules are in effect.
Binding of Referencing Environments • Referencing environment of a statement at run time is the set of active bindings. • A referencing environment corresponds to a collection of scopes that are examined (in order) to find a binding. • Scope rules determine that collection and its order. • Binding rules determine which instance of a scope should be used to resolve references when calling a procedure that was passed as a parameter: • They govern the binding of referencing environments to formal procedures. • Shallow binding: the referencing environment created only when the subroutine is actually called. • Deep binding: the referencing environment when the subroutine was passed as a parameter.
Semantic Analyzer (Ch. 4) • The principal job of the semantic analyzer is to enforce static semantic rules: • Constructs a syntax tree (usually first). • Information gathered is needed by the code generator. • This interface is a boundary between the front end and the back end. • There is a considerable variety in the extent to which parsing, semantic analysis, and intermediate code generation are interleaved. • Fully separated phases: a full parse tree, a syntax tree, and semantic check. • Fully interleaved phases: no need to build both pars and syntax trees. • A common approach interleaves construction of a syntax tree with parsing (no explicit parse tree), follows with separate, sequential phases for semantic analysis and code generation.
Static Analysis • Compile-time algorithms that predict run-time behavior. • It is precise if it allows the compiler to determine whether a given program will always follow the rules: type checking. • Also useful when not precise: a combination of compile time check and code for run time checking. • Static analysis is also used for code improvement: • Alias analysis: when values can be safely cached in registers. • Escape analysis: all references to a value confined to a given context. • Subtype analysis: an OO variable is of a certain subtype. • Unsafe and speculative optimization. • Conservative and optimistic compilers. • Some languages have tighter semantic rules to avoid dynamic checking.
Attribute Grammars • Both semantic analysis and (intermediate) code generation can be described in terms of annotation, or “decoration” of a parse or syntax tree. • Attribute grammars provide a formal framework for decorating such a tree. • The attribute grammar serves to define the semantics of the input program. • Attribute rules are best thought of as definitions, not assignments. • They are not necessarily meant to be evaluated at any particular time, or in any particular order, though they do define their left-hand side in terms of the right-hand side.
Synthesized Attributes • The S-attributed grammar uses only synthesized attributes. • Its attribute flow (attribute dependence graph) is purely bottom-up. • The arguments to semantic functions in an S-attributed grammar are always attributes of symbols on the right-hand side of the current production. • The return value is always placed into an attribute of the left hand side of the production. • The intrinsic properties of tokens are synthesized attributes initialized by the scanner.
Inherited Attributes • Inherited attributes: values are calculated when their symbol is on the right-hand side of the current production. • Contextual information flow into a symbols for above or from the side: provide different context. • Symbol table information is commonly passed be means of inherited attributes. • Inherited attributes of the root of the parse tree can be used to represent external environment. • Example: left-to-right associativity may create a situation where an S-attributed grammar would be cumbersome to use. By passing attribute values left-to-right in the tree, things are much simpler.
Parsers and Attribute Grammars • Each synthetic attribute of a LHS symbol (by definition of synthetic) depends only on attributes of its RHS symbols. • A bottom-up parser: in general paired with an S-attributed grammar. • Each inherited attribute of a RHS symbol (by definition of L-attributed) depends only on: • Inherited attributes of the LHS symbol, or • Synthetic or inherited attributes of symbols to its left in the RHS. • A top-down parser: in general paired with an L-attributed grammar. • There are certain tasks, such as generation of code for short-circuit Boolean expression evaluation, that are easiest to express with non-L-attributed attribute grammars. • Because of the potential cost of complex traversal schemes, however, most real-world compilers insist that the grammar be L-attributed.
Translation Scheme • There are automatic tools that construct a semantic analyzer (attribute evaluator) for a given attribute grammar. In other words, they generate translation schemes for context-free grammars or tree grammars (which describe the possible structure of a syntax tree): • These tools are heavily used in syntax-based editors and incremental compilers. • Most ordinary compilers, however, use ad-hoc techniques. • Most production compilers use an ad hoc, handwritten translation scheme: • Interleave parsing with at least the initial construction of a syntax tree. • Possibly all of semantic analysis and intermediate code generation. • Since the attributes of each production are evaluated as the production is parsed, there is no need for the full parse tree.
Action Routines • An ad-hoc translation scheme that is interleaved with parsing takes the form of a set of action routines: • An action routine is a semantic function that we tell the compiler to execute at a particular point in the parse. • Semantic analysis &code generation interleaved with parsing: action routines can be used to perform semantic checks and generate code. • LL parser generator: an action routine can appear anywhere within a right-hand side. • Implementation: when the parser predicts a production, the parser pushes all of the right hand side onto the stack. • If semantic analysis & code generation are separate phases, then action routines can be used to build a syntax tree: • A parse tree could be built completely automatically. • We wouldn't need action routines for that purpose.
Space Management for Attributes • If there is a parse tree, the attributes can be stored in nodes. • For a bottom-up parser with an S-attributed grammar, maintain an attribute stack mirroring the parse stack: • Next to every state number is an attribute record for the symbol shifted when entering the state. • Entries are pushed and popped automatically. • For a top-down parser with an L-attributed grammar: • Automatic: an attribute stack that does not mirror the parse stack. • Short-cutting copy rules: action routines allocate and deallocate space for attributes explicitly. • Contextual information: • Symbol table that always represents the current referencing environment.
Chapter 5 • Not included in the exam.
Control Flow (Ch. 6) • Control flow (or ordering) in program execution. • Eight principal categories of language mechanisms used to specify ordering: • Sequencing. • Selection. • Iteration. • Procedural abstraction. • Recursion. • Concurrency. • Exception handling and speculation. • Nondeterminacy. • The relative importance of different categories of control flow varies significantly among the different classes of programming languages.
Expression Evaluation • Expression: • A simple object: e.g., a literal constant, a named variable or constant. • An operator or function applied to a collection of operands or arguments, each of which in turn is an expression. • Function calls: a function name followed by a parenthesized, comma-separated list of arguments. • Operator: built-in function that uses special, simple syntax – one or two arguments, no parenthesis or commas. • Sometimes they are “syntactic sugar” for more “normal” looking functions (in C++ a+b is shprt for a.operator+(b)) • Operand: an argument of an operator.
Precedence and Associativity • Infix notation requires the use of parenthesis to avoid ambiguity. • The choice among alternative evaluation orders depends on the precedence and associativity of the operators: • C has very rich precedence structure: problems with remembering all the precedence levels (15 levels). • Pascal has relatively flat precedence hierarchy (3 levels). • APL and Smalltalk: all operators are of equal precedence. • Associativity rules specify whether sequences of operators of equal precedence group to the right or to the left: • Usually the operators associate left-to-right. • Fortran: the exponentiation operator ** associates right-to-left. • C: the assignment operator associates right-to-left.
References and Values • Subtle but important differences in the semantics of assignment in different imperative languages. • Based on the context, a variable may refer to the value of the variable (r-value) or its location (l-value) – a named container for a value. • Value model of variables: an expression can be either an l-value or an r-value, based on the context in which it appears. • Built-in types can’t be passed uniformly to methods expecting class type parameters: wrapper classes, automatic boxing/unboxing. • Reference model of variables: a variable is a named reference for a value – every variable is an l-value. • E.g., integer values (like 2) are immutable. • A variable has to be dereferenced to obtainits value. a 4 a 4 b 2 b 2 c 2 c
Structured and Unstructured Flow • Assembly language: conditional and unconditional branches. • Early Fortran: relied heavily on goto statements (and labels): IF (A .LT. B) GOTO 10 …10 • Late 1960s: Abandoning of GOTO statements started. • Move to structured programming in 1970s: • Top-down design (progressive refinement). • Modularization of code. • Descriptive variable. • Within a subroutine, a well-designed imperative algorithm can be expressed with only sequencing, selection, and iteration. • Most of the structured control-flow constructs were introduced by Algol 60.
Sequencing • The principal means of controlling the order in which side effects occur. • Compound statement: a delimited list of statements. • Block: a compound statement optionally preceded by a set of declarations. • The value of a list of statements: • The value of its final element (Algol 68). • Programmers choice (Common Lisp – not purely functional). • Can have side effects; very imperative, von Neumann. • There are situations where side effects in functions are desirable: random number generators. • Euclid and Turing: functions are not permitted to have side effects.
Selection • Selection statement: mostly some variant of if…then…else. • Languages differ in the details of the syntax. • Short-circuited conditions: • The Boolean expression is not used to compute a value but to cause control to branch to various locations. • Provides a way to generate efficient (jump) code. • Parse tree: inherited attributes of the root inform it of the address to which control should branch:if ((A > B) and (C > D)) or (E ≠ F) then r1 := A r2 := Bthen_clause if r1 <= r2 goto L4else r1 := C r2 := Delse_clause if r1 > r2 goto L1 L4: r1 := E r2 := F if r1 = r2 goto L2 L1: then_clause goto L3 L2: else_clause L3:
Iteration • Iteration: a mechanism that allows a computer to perform similar operations repeatedly. • Favored in imperative languages. • Mostly some form of loops executed for their side effects: • Enumeration-controlled loops: executed once of every value in a given finite set. • Logically controlled loops: executed until some Boolean condition changes value. • Combination loops: combines the properties of enumeration-controlled and logically controlled loops (Algol 60). • Iterators: executed over the elements of a well-defined set (often called containers or collections in object-oriented code).
Recursion • Recursion requires no special syntax: why? • Recursion and iteration are equally powerful. • Most languages provide both iteration (more “imperative”) and recursion (more “functional”). • Tail-recursive function: additional computation never follows a recursive call. The compiler can reuse the space, i.e., no need for dynamic allocation of stack space.int gcd(int a, int b) { if (a == b) return a; else if (a > b) return gcd(a - b,b); else return gcd(a, b – a);} • Sometimes simple transformations are sufficient to produce tai-recursive code: continuation-passing style.
Data Types (Ch. 7) • Implicit context for many operations: • The programmer does not have to specify the context explicitly. • Example: in C, the expressions a+b will use integer addition if a and b are integers, floating point addition if a and b are floating points. • Limit the set of operations that may be performed in a semantically valid program: • Example: prevent from adding a character and a record. • Type checking cannot prevent all meaningless operations. • It catches enough of them to be useful.
Classification of Types • Discrete (ordinal) types – countable: integer, boolean, char, enumeration, and subrange. • Scalar (simple) types - one-dimensional: discrete, rational, real, and complex. • Composite types: • Records (structures). • Variant records (unions). • Arrays; strings are arrays of characters. • Sets: the mathematical powerset of their base types. • Pointers: l-values. • Lists: no notion of mapping or indexing. • Files.
Type Systems • A type system consists of: • A mechanism to defines types and associate them with certain language constructs. • A set of rules for type equivalence, type compatibility, type inference: • Type equivalence: when are the types of two values the same? • Type compatibility: when can a value of type A be used in a context that expects type B? • Type inference: what is the type of an expression, given the types of the operands? • Compatibility is the more useful concept, because it tells you what you can do. • Polymorphism results when the compiler finds that it doesn't need to know certain things. • Subroutines need to have types if they are first- or second-class value.
Type Checking • Type checking is the process of ensuring that a program obeys the language’s type compatibility rules. • Type clash: a violation of these rules. • Strong typing means that the language prevents you from applying an operation to data on which it is not appropriate. • Static typing: compiler can check all at compile time. • Examples: • Common Lisp is strongly typed, but not statically typed. • Ada is statically typed, Pascal is almost statically typed. • C less strongly typed than Pascal, e.g. unions, subroutines with variable numbers of parameters. • Java is strongly typed, with a non-trivial mix of things that can be checked statically and things that have to be checked dynamically. • Scripting languages are generally dynamically typed.