Languages and grammars

Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is just a finite set of symbols (e.g., ASCII, Unicode).

Programs and languages • A string is a program in programming language L iff it is a member of L. • Question: how to determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.

Grammar rules • Language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Pieces without constituents (terminals) must be members of the alphabet

A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may (not must) consist of • the terminal "begin" • followed by a block constitutent, • followed by the terminal "end" • followed by the terminal "." • The symbol <block> is a nonterminal -- it does not correspond to a member of the alphabet. • Actually, if the alphabet is ASCII or Unicode, none of the three terminal symbols belongs to the alphabet either, although unlike <block>, these symbols are expected to appear in the program text. We address this issue later.

Grammars • Additional rules can give the constituents of constituents like blocks. • Rules that define a language in this way make up a grammar. • Grammars must also identify • which nonterminals correspond to language members (e.g., <program>) • which symbols are terminal symbols

Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has a nonterminal on its left-hand side (LHS), and a string of symbols (terminals or nonterminals) on its right-hand side (RHS). • Nonterminals are also called variables.

Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.

Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way. • Note that the vertical bar is neither a terminal nor a nonterminal. • Sometimes such a symbol is called a metasymbol. • The notational conventions described above are called Backus-Naur form, or BNF.

Grammar summary • In summary, a grammar is specified by • a finite set of terminals, • a finite set of nonterminals, • a finite set of rules (or productions), and • a start symbol (a nonterminal) • The start symbol tells what it is that the grammar is defining.

Grammars and parsing • Any CFG G defines a language L(G) -- the set of strings that can be generated from its start symbol using its rules. • Proving that a string has the correct constituency and linear ordering properties to be in L(G) is called parsing.

Parsing and parse trees • One way of summarizing a parse is with a parse tree (cf. T&N, Sec 2.1.3). • Here parent nodes correspond to LHSs of rules, children (in order) to RHSs, and leaves to terminals. • The string of leaves (from left to right) is the yield of the parse tree.

Terminals and the alphabet • We still have not resolved a mismatch between terminals and the alphabet • We have suggested that alphabets are sets of characters. • We have said that terminals must be members of the alphabet. • But our sample terminals (e.g., begin, end) were not characters

Tokens • One way out: replace "begin" by <begin> and add a rule with five RHS terminals, i.e. • <begin> -> b e g i n • However there are independent reasons to instead treat such substrings as special entities called tokens.

Typed tokens • It helps to identify tokens representing identifiers, integer literals, etc. as instances of a type (or category). • Reserved words like "begin" would be the only instances of their types. • Type names can be represented as strings with the same spelling as the category name, or as members of an enumerated type.

Lexical analysis • Real parsers group characters into (typed) tokens in a preprocessing step called lexical analysis (or scanning). • This step allows CFGs • to have terminals with multiple characters • to treat nonterminals representing types as special terminals called preterminals. • The constituency of preterminals is handled by special scanning rules.

Why "lexical"? • CFGs for English tend to have preterminals N, V, P, etc. for lexical categories noun, verb, preposition, etc. • Rules for these preterminals form a lexicon -- a list of words in the language labeled with their categories • Omitting these rules allows generation much of English by a simple CFG.

A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V

EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.

EBNF constructions • ( ) parentheses, for removing ambiguity, e.g. • (a|b)c vs. a | bc • [ ] brackets, for optionality (0 or 1 times) • { } braces, for indefinite repetition (0 or more times) • Sometimes the first of these is considered part of ordinary BNF.

A very simple grammar • S -> x | x S S • This grammar generates all strings of x's of odd length.

An ambiguous grammar for algebraic expressions • E -> E + E | E -> E * E • E -> x | y • E -> ( E ) • Note that here the parenthesis symbols are terminal symbols of the grammar (not metasymbols) • An unambiguous grammar for algebraic expressions • E -> T | E + T • T -> F | T * F • F -> x | y | ( E ) • Once again the parenthesis symbols are terminals

A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note that we assume that digits and nondigits are identified by the scanner, and not the parser

if-statements in C • <selection-statement> ::= • if ( <expression> ) <statement> [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols

if-statements in Ada • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;

statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> } • Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...

Scanning (lexical analysis) • Scanning could be done by a parser; a special-purpose scanner is generally more efficient • how to recognize tokens • longest substring • white space • The scanner needs to identify categories of tokens for the parser. • Categories of tokens • keywords • reserved word, predefined identifiers • literals (cf. constants) • numeric, string, Boolean, array, enumeration members, Lisp lists, … • identifiers

Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules

Categories of tokens • keywords • reserved words, predefined identifiers • literals (cf. constants) • numeric, string, Boolean, array, enumeration members, Lisp lists, … • identifiers

Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognitizer s() with body • np(); vp();

Complications in recursive descent • scanning issues • RHSs with terminals • conflict between two rules with the same LHS • optionality (including indefinite repetition) • output and error handling

Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • Else there is a syntax error in the input.

Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match('('); • e(); • match(')');

Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal "primary" in T&N, p. 79.

Optionality • Optionality (the use of brackets in EBNF) effectively gives multiple rules for the nonterminal on the LHS. • e.g., the "factor" recognizer, T&N, p. 79. • The same applies to indefinite repetition (the use of braces in EBNF). • Here the repetition may be handled by a while loop, (cf. "term", T&N p. 79).

Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • The conditional's first case will be used if the lookahead symbol is in First(a), the second case if it's in First(b), etc. • Here, First(X) is the set of terminals that may begin the yield of X.

The First function • T&N describe an algorithm for computing the First function for any grammar symbol. • It may be used to find all values of First(X) • In recursive descent parsing, First(X) must be disjoint from First(Y) for any two RHSs X and Y (for the same LHS) .

Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one of its rules, as in • E -> E + E | T • If t is the first token of a string generated from T, and also the lookahead token, it can't be decided which E rule to apply.

Another potential problem • Another problem for recursive descent parsers arises from optionality. • Such a parser using a rule NP -> Det {Adj} N can't tell whether "rich" is an N or an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. Louden) .

Abstract syntax trees • (Abstract) syntax trees may replace parse trees as the interface between syntactic and semantic processing -- cf. T&N, Section 2.5.1. • Symbols irrelevant to semantic processing needn’t appear in syntax trees. • So the form of a syntax tree is not completely determined by the grammar.

Bindings • A variable (cf. T&N, p 88) is an entity with attributes including name, address, type, and value. • Much semantic behavior can be understood in terms of the binding of attributes to their values. • Many differences between programming languages can be understood in terms of when and how such bindings are made.

Static vs. dynamic • The following terms apply to bindings (and many other concepts): • Static • pertaining to compilation time • (or more generally, to a time before execution time) • Dynamic: • pertaining to run time (execution time)

Examples of static bindings • value • of predefined identifers • of largest possible int • address • for global variables (relative to beginning of program storage)

More examples of static bindings • isConstant • for variables • type • for variables • body & arity & return type & parameter type(s) • for functions in C (local or external)

Examples of dynamic bindings • value • for typical variables • address • for local variables (cf. use of new) • parameter value • for functions • method body • for methods in Java

Lvalues and Rvalues • Most languages give different interpretations to variables on different sides of an assignment operator, e.g. • a := b; • The first (the lvalue) refers to an address while the second (the rvalue) refers to a value.

Creating bindings • Bindings for user-defined variables are created by declarations, which for us includes • explicit declaration, • implicit declaration, and • definition

Resolving bindings • Many languages allow reuse of names during program execution. • So when a program uses a name, it needs to know what the bindings are for that name then and there. • That is, it needs to know which declaration for the name is being used to determine the bindings.

Scope and scoping policies • Every language has a scoping policy to determine which bindings apply. • For T&N, the scope of a name is that portion of a program for which the name's bindings are in effect. • Scoping policies may specify those program segments that may serve as scopes of bindings -- T&N call these segments simply scopes.

Languages and scopes • What may count as a scope is language-dependent (cf. Table 4.1, p. 90, T&N). • Typical possibilities are • function definitions • class definitions • loops • compilation units • compound statements (blocks)

Overlapping scopes • Languages may allow certain types of scopes to be nested. • Scopes may not otherwise overlap. • So the notion of moving outward from one scope to another is always well-defined.

Languages and grammars