CS 152, Programming Paradigms Fall 2011, SJSU

CS 152, Programming ParadigmsFall 2011, SJSU Jeff Smith

Programs as text strings • We can think of programsas strings of • characters, or • tokens (substrings similar to words), or • typed tokens • We’ll consider (untyped) tokens first. • So programs are sequences of tokens • cf. English sentences as sequences of words • And languages are sets of programs

Identifying tokens • Splitting a program into tokens is called • lexical analysis, or • lexing, or • scanning • We’ll touch on algorithms for doing this later.

Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is a finite set of symbols, e.g. • the ASCII characters • the Unicode characters • the legal token types of Java • But not the legal tokens of Java!

Programs and languages • A string is a program in programming language L iff it is a member of L. • So how does one determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.

Grammar rules • A language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Pieces without constituents (terminals) must be members of the alphabet.

A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may consist of • the terminal "begin" • followed by a block constituent, • followed by the terminal "end" • followed by the terminal "." • Here, terminals correspond to tokens.

Grammars • Additional rules would give the constituents of blocks and other constituents. • A grammar specifies a language by • listing the legal terminal symbols, • listing the legal nonterminal symbols, • (e.g., <program>, <block>), • listing the rules, • saying which nonterminal is the start symbol (represents a language member)

Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has • a nonterminal on its left-hand side (LHS) • a string of symbols (terminals or nonterminals) on its right-hand side (RHS). • Nonterminals are also called variables.

Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.

Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way.

Metasymbols and BNF • In the rule • <operator> ::= + | - • the vertical bar is a metasymbol • it’s neither a terminal nor a nonterminal. • The notational conventions described above are called Backus-Naur form, or BNF.

Grammars and parsing • Any CFG G defines a language L(G) • L(G) is the set of strings that can be generated from G’s start symbol using its rules. • Proving that a string is in L(G) is called parsing. • Parsing algorithms use the rules of G to show that the string has the correct constituency with the correct linear order to be in L(G).

Parsing and parse trees • One way of summarizing a parse is with a parse tree (cf. T&N, Section 2.1.3). • Here parent nodes correspond to LHSs of rules, children (in order) to RHSs, and leaves to terminals. • The string of leaves (from left to right) is the yield of the parse tree.

Identifying tokens (again) • We haven’t yet said how to identify tokens • given that programs are character strings • Languages may require tokens to be • single characters, or • delimited by whitespace characters • of bounded length • But these restrictions are rare • they inconvenience programmers

Grammars for token types • CFGs may be used for token types, e.g. • <identifier> ::= <letter> • <identifier> ::= <letter> <identifier> • would recognize nonempty strings of letters • But this can introduce ambiguity • e.g., is the string doif 1 token? 2? 3? 4?

Scanning in real parsers • It’s generally efficient for a parser to have a special preprocessing step for scanning. • Scanning algorithms are simplified parsers • using CFGs for token types • such CFGs are generally simple • for details, see CS 154 • and its treatment of regular expressions

Scanning issues • It’s common for scanners to • work left to right • treat whitespace characters as delimiters • otherwise disambiguate by choosing the longest of the possible tokens • determine a type for each token • Types may be singleton types or not, e.g. • a token if might be in a type by itself • a token type identifier would presumably have many instances

Categories of tokens and token types • Keywords • reserved words, predefined identifiers • generally in a type by themselves • Literals (cf. constants) • numeric, string, Boolean, array, enumeration members, lists, … • Identifiers • for variables, functions, data types, …

Typed tokens • Grammar symbols representing token types • are sometimes called preterminals • are treated as terminals by the parser • The tokens themselves are to be recognized by a scanner • may appear as nonleaves in parse trees • with a single child representing the token

Why "lexical"? • CFGs for English can have preterminals • especially for lexical categories • e.g., N (for nouns), V (for verbs), … • Rules for these preterminals form a lexicon • a list of words labeled with their categories • This allows for a simple CFG for English • or at least a healthy fragment of English

A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V NP • Here N, V, P, and Det are preterminals. • the sentence the dog chased a cat would look like Det N V Det N to the parser.

EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.

EBNF constructions • ( ) parentheses, for removing ambiguity, • e.g.,(a|b)c vs. a|bc • [ ] brackets, for optionality • 0 or 1 times • { } braces, for indefinite repetition • 0 or more times • Sometimes the first of these is considered part of ordinary BNF.

A very simple grammar • S -> x | x S S • Here, the single terminal represents a token. • This grammar generates all strings of x's of odd length.

Grammars for algebraic expressions • An ambiguous grammar G: • E -> E + E | E * E | ( E ) | x | y • Here the parentheses aren’t metasymbols • They are terminal symbols of the grammar • Like the other terminals, they represent tokens • An unambiguous grammar for L(G) • E -> T | E + T • T -> F | T * F • F -> x | y | ( E )

Ambiguity • Ambiguity is a property of grammars • A language can have both ambiguous and unambiguous grammars – see the previous slide • A grammar G is ambiguous iff some string in L(G) has two or more legal parse trees • That slide’s 2nd grammar disambiguates re both associativity and precedence • cf. T&N, Sections 2.1.4, 2.1.5, and 7.2.2

A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note the absence of rules for <digit> and <nondigit> • These are preterminal symbols • corresponding to token types to be identified by the scanner rather than the parser

if-statements in C (cf. Kernighan & Ritchie) • <selection-statement> ::= • if ( <expression> ) <statement> • [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols

if-statements in Ada • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;

statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> } • Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...

A BNF grammar for the Scheme language • <expression> ::= <atom> | <list> • <atom> ::= <literal> | <identifier> • <list> ::= () | ( <expressions> ) • <expressions> ::= <expression> | • <expression> <expressions> • Here <literal> and <identifier> are preterminals; the parentheses are terminals in types by themselves.

Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules

Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognizer s() with body • np( ); vp( );

Complications in recursive descent • scanning issues • RHSs with terminals • conflict between rules with the same LHS • optionality • including indefinite repetition • output • error handling

Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • If not, there is a syntax error in the input.

Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match( '(‘ ); • e( ); • match( ')‘ );

Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal primary in T&N, p. 79.

Optionality • Optionality (the use of brackets in EBNF) effectively gives multiple rules for the nonterminal on the LHS. • e.g., the factor recognizer, T&N, p. 79. • The same applies to indefinite repetition (the use of braces in EBNF) • Here the repetition may be handled by a while loop, • e.g. the term recognizer, T&N p. 79

Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • If the conditional's lookahead symbol • is in First(a), one case will be used • is in First(b), another case will be used • etc. • Here, First(X) is the set of terminals that may begin the yield of X.

The First function • For simple grammars, the First function may be easy to compute by hand. • T&N give a general algorithm for computing it -- which may be used even if the argument is a sequence of symbols. • In recursive descent parsing, First(X) must be disjoint from First(Y) for any two RHSs X and Y (for the same LHS) .

Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one or more of its rules, as in • E -> E + E | T • If the lookahead token t is also the first token of a string generated from T, the parser won’t know which E rule to apply.

Another potential problem • Another problem for recursive descent parsers arises from optionality. • Given a rule NP -> Det {Adj} N, there’d be a conflict between parsing rich as a N and an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. Louden) .

Abstract syntax trees • (Abstract) syntax trees can be better than parse trees as the interface between syntactic and semantic processing • e.g., T&N, Figures 2.10, 2.13, and 7.1 • cf. Section 2.5.1 & Section 7.2.1 • For syntax trees (unlike parse trees) • Nonterminal symbols needn’t appear • The form isn’t completely determined by the grammar

CS 152, Programming Paradigms Fall 2011, SJSU