480 likes | 610 Views
CS 152, Programming Paradigms Fall 2012, SJSU. Jeff Smith. Programs as text strings. We can think of programs as strings of characters, or tokens (substrings similar to words), or token types (or typed tokens) We’ll consider (untyped) tokens first.
E N D
CS 152, Programming ParadigmsFall 2012, SJSU Jeff Smith
Programs as text strings • We can think of programsas strings of • characters, or • tokens (substrings similar to words), or • token types (or typed tokens) • We’ll consider (untyped) tokens first. • So programs are finite sequences of tokens • as English sentences are sequences of words • And languages are sets of programs
Identifying tokens • Splitting a program into tokens is called • lexical analysis, or • lexing, or • scanning • A scanning algorithm may give each token a type, e.g. • identifier, integer literal, addition operator • We’ll touch on scanning algorithms later.
Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is a finite set of symbols, e.g. • the ASCII characters • the Unicode characters • the legal token types of Java • but not the infinite set of legal tokens of Java!
Programs and languages • A string is a program in programming language L iff it is a member of L. • So how does one determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.
Grammar rules • A language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Constituents without their own constituents are called terminals. Terminals must be members of the alphabet.
A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may consist of • the terminal "begin" • followed by a block constituent, • followed by the terminal "end" • followed by the terminal "." • Here, terminals correspond to tokens • or more precisely, to token types
Grammar rules and constituents • Terminals like begin and . and end • correspond to token types with just one instance • The constituency of <block> would be given by one or more other rules of the grammar
Grammars • A grammar specifies a language by • listing the legal terminal symbols, • listing the legal nonterminal symbols, • (e.g., <program>, <block>), • listing the rules, and • saying which nonterminal is the start symbol • and thus represents a member of the language
Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has • a nonterminal on its left-hand side (LHS) • a string of symbols (terminals or nonterminals) on its right-hand side (RHS).
Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.
Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way.
Terminology for rules • Grammar rules are sometimes called productions. • Nonterminal symbols are sometimes called variables.
Metasymbols and BNF • In the rule • <operator> ::= + | - • the vertical bar is a metasymbol • it’s neither a terminal nor a nonterminal. • The notational conventions described above are called Backus-Naur form, or BNF.
Grammars and parsing • Any CFG G defines a language L(G) • L(G) is the set of strings that can be generated from G’s start symbol, consistent with G’s rules • Proving that a string is in L(G) is called parsing. • Such a proof may involve either a derivation or a parse tree.
Derivations • In this course we won’t talk much about derivations. • But intuitively, a derivation is a sequence of strings (cf. Figure 5.3, p 210) such that • the first string is the start symbol • the last string is the string to be parsed • every string can be obtained from its predecessor by a rewriting step that is licensed by a rule of the grammar
Parse trees • In a parse tree (cf. L&L, Section 6.3), • parent nodes correspond to LHSs of rules • children correspond (in order) to RHSs, • leaves correspond to terminals. • The string of leaves (from left to right) is the yield of the parse tree.
Identifying tokens (again) • We haven’t yet said how to identify tokens, given a program as a character string. • It would be simplest to require tokens to be • single characters, or • delimited by whitespace characters, or • of bounded length • But these restrictions are rare • they greatly inconvenience programmers
Grammars for token types • CFGs may be used for token types, e.g. • <identifier> ::= <letter> • <identifier> ::= <letter> <identifier> • would recognize nonempty strings of letters • But this can introduce ambiguity • e.g., is the string doif 1 token? 2? 3? 4?
Scanning in real parsers • It’s generally efficient for a parser to have a special preprocessing step for scanning. • Scanning algorithms are simplified parsers • using CFGs for token types • such CFGs are generally simple • for details, see CS 154 • and its treatment of regular expressions
Scanning issues • It’s common for scanners to • work left to right • treat whitespace characters as delimiters • otherwise disambiguate by choosing the longest of the possible tokens • determine a type for each token • Types may be singleton types or not, e.g. • a token if might be in a type by itself • a token type identifier might have infinitely many instances
Categories of tokens and token types • Keywords • reserved words, predefined identifiers • generally in a type by themselves • Literals (cf. constants) • numeric, string, Boolean, array, enumeration members, lists, … • Identifiers • for variables, functions, data types, …
Typed tokens • Grammar symbols representing token types • are sometimes called preterminals • are treated as terminals by the parser • the tokens themselves are to be recognized by a scanner • may appear as nonleaves in parse trees • with a single child representing the token
Why "lexical"? • CFGs for English can have preterminals • especially for lexical categories • e.g., N (for nouns), V (for verbs), … • Rules for these preterminals form a lexicon • a list of words labeled with their categories • This allows for a simple CFG for English • or at least a healthy fragment of English
A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V NP • Here N, V, P, and Det are preterminals. • the sentence the dog chased a cat would look like Det N V Det N to the parser.
EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.
EBNF constructions • ( ) parentheses, for removing ambiguity, • e.g.,(a|b)c vs. a|bc • [ ] brackets, for optionality • 0 or 1 times • { } braces, for indefinite repetition • 0 or more times • Sometimes the first of these is considered part of ordinary BNF.
A very simple grammar • S -> x | x S S • Here, the single terminal represents a token. • This grammar generates all strings of x's of odd length.
Grammars for algebraic expressions • An ambiguous grammar G: • E -> E + E | E * E | ( E ) | x | y • Here the parentheses aren’t metasymbols • They are terminal symbols of the grammar • Like the other terminals, they represent tokens • An unambiguous grammar for L(G) • E -> T | E + T • T -> F | T * F • F -> x | y | ( E )
Ambiguity • Ambiguity is a property of grammars • A language can have both ambiguous and unambiguous grammars – see the previous slide • A grammar G is ambiguous iff some string in L(G) has two or more legal parse trees • That slide’s 2nd grammar disambiguates re both associativity and precedence • cf. L&L, Section 6.4
A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note the absence of rules for <digit> and <nondigit> • These are preterminal symbols • the corresponding token types are to be identified by the scanner rather than the parser
C language if statements (cf. Kernighan & Ritchie) • <selection-statement> ::= • if ( <expression> ) <statement> • [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols
Ada’s if statements • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;
General statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> }
Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...
A BNF grammar for the Scheme language • <expression> ::= <atom> | <list> • <atom> ::= <literal> | <identifier> • <list> ::= () | ( <expressions> ) • <expressions> ::= <expression> | <expression> <expressions> • Here <literal> and <identifier> are preterminals; the parentheses are terminals in types by themselves.
Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules
Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognizer s() with body • np( ); vp( );
Complications in recursive descent • scanning issues • RHSs with terminals • conflict between rules with the same LHS • optionality • including indefinite repetition • output • error handling
Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • If not, there is a syntax error in the input.
Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match( '(‘ ); • e( ); • match( ')‘ );
Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal factor in L&L, p. 228.
Optionality • Optionality effectively gives multiple rules for the nonterminal on the LHS. • e.g., the ifStatement code, L&L, p. 227 • The same applies to indefinite repetition. • Here the repetition may be handled by a while loop. • e.g. the expr recognizer, L&L p. 229
Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • If the conditional's lookahead symbol • is in First(a), one case will apply • is in First(b), another case will apply • etc. • Here, First(X) is the set of terminals that may begin the yield of X.
The First function • For simple grammars, the First function may be easy to compute by hand. • Tucker & Noonan give a general algorithm for finding First(X)for a symbol X • it works even for sequences of symbols. • Recursive descent works only if First(a) is disjoint from First(b) for all a and b • in the situation of the previous slide
Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one or more of its rules, as in • E -> E + E | T • If the lookahead token t is also the first token of a string generated from T, the parser won’t know which E rule to apply.
Another potential problem • Another problem for recursive descent parsers arises from optionality. • Given a rule NP -> Det {Adj} N, there’d be a conflict between parsing rich as a N and an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. L&L, p. 232) .
Abstract syntax trees • Parse trees aren’t the best interface between syntactic and semantic processing • (Abstract) syntax trees can be better • cf. L&L, p. 216 • For syntax trees (unlike parse trees) • nonterminal symbols needn’t appear • the form isn’t completely determined by the grammar