440 likes | 587 Views
CS 152, Programming Paradigms Fall 2011, SJSU. Jeff Smith. Programs as text strings. We can think of programs as strings of characters, or tokens (substrings similar to words), or typed tokens We’ll consider (untyped) tokens first. So programs are sequences of tokens
E N D
CS 152, Programming ParadigmsFall 2011, SJSU Jeff Smith
Programs as text strings • We can think of programsas strings of • characters, or • tokens (substrings similar to words), or • typed tokens • We’ll consider (untyped) tokens first. • So programs are sequences of tokens • cf. English sentences as sequences of words • And languages are sets of programs
Identifying tokens • Splitting a program into tokens is called • lexical analysis, or • lexing, or • scanning • We’ll touch on algorithms for doing this later.
Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is a finite set of symbols, e.g. • the ASCII characters • the Unicode characters • the legal token types of Java • But not the legal tokens of Java!
Programs and languages • A string is a program in programming language L iff it is a member of L. • So how does one determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.
Grammar rules • A language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Pieces without constituents (terminals) must be members of the alphabet.
A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may consist of • the terminal "begin" • followed by a block constituent, • followed by the terminal "end" • followed by the terminal "." • Here, terminals correspond to tokens.
Grammars • Additional rules would give the constituents of blocks and other constituents. • A grammar specifies a language by • listing the legal terminal symbols, • listing the legal nonterminal symbols, • (e.g., <program>, <block>), • listing the rules, • saying which nonterminal is the start symbol (represents a language member)
Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has • a nonterminal on its left-hand side (LHS) • a string of symbols (terminals or nonterminals) on its right-hand side (RHS). • Nonterminals are also called variables.
Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.
Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way.
Metasymbols and BNF • In the rule • <operator> ::= + | - • the vertical bar is a metasymbol • it’s neither a terminal nor a nonterminal. • The notational conventions described above are called Backus-Naur form, or BNF.
Grammars and parsing • Any CFG G defines a language L(G) • L(G) is the set of strings that can be generated from G’s start symbol using its rules. • Proving that a string is in L(G) is called parsing. • Parsing algorithms use the rules of G to show that the string has the correct constituency with the correct linear order to be in L(G).
Parsing and parse trees • One way of summarizing a parse is with a parse tree (cf. T&N, Section 2.1.3). • Here parent nodes correspond to LHSs of rules, children (in order) to RHSs, and leaves to terminals. • The string of leaves (from left to right) is the yield of the parse tree.
Identifying tokens (again) • We haven’t yet said how to identify tokens • given that programs are character strings • Languages may require tokens to be • single characters, or • delimited by whitespace characters • of bounded length • But these restrictions are rare • they inconvenience programmers
Grammars for token types • CFGs may be used for token types, e.g. • <identifier> ::= <letter> • <identifier> ::= <letter> <identifier> • would recognize nonempty strings of letters • But this can introduce ambiguity • e.g., is the string doif 1 token? 2? 3? 4?
Scanning in real parsers • It’s generally efficient for a parser to have a special preprocessing step for scanning. • Scanning algorithms are simplified parsers • using CFGs for token types • such CFGs are generally simple • for details, see CS 154 • and its treatment of regular expressions
Scanning issues • It’s common for scanners to • work left to right • treat whitespace characters as delimiters • otherwise disambiguate by choosing the longest of the possible tokens • determine a type for each token • Types may be singleton types or not, e.g. • a token if might be in a type by itself • a token type identifier would presumably have many instances
Categories of tokens and token types • Keywords • reserved words, predefined identifiers • generally in a type by themselves • Literals (cf. constants) • numeric, string, Boolean, array, enumeration members, lists, … • Identifiers • for variables, functions, data types, …
Typed tokens • Grammar symbols representing token types • are sometimes called preterminals • are treated as terminals by the parser • The tokens themselves are to be recognized by a scanner • may appear as nonleaves in parse trees • with a single child representing the token
Why "lexical"? • CFGs for English can have preterminals • especially for lexical categories • e.g., N (for nouns), V (for verbs), … • Rules for these preterminals form a lexicon • a list of words labeled with their categories • This allows for a simple CFG for English • or at least a healthy fragment of English
A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V NP • Here N, V, P, and Det are preterminals. • the sentence the dog chased a cat would look like Det N V Det N to the parser.
EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.
EBNF constructions • ( ) parentheses, for removing ambiguity, • e.g.,(a|b)c vs. a|bc • [ ] brackets, for optionality • 0 or 1 times • { } braces, for indefinite repetition • 0 or more times • Sometimes the first of these is considered part of ordinary BNF.
A very simple grammar • S -> x | x S S • Here, the single terminal represents a token. • This grammar generates all strings of x's of odd length.
Grammars for algebraic expressions • An ambiguous grammar G: • E -> E + E | E * E | ( E ) | x | y • Here the parentheses aren’t metasymbols • They are terminal symbols of the grammar • Like the other terminals, they represent tokens • An unambiguous grammar for L(G) • E -> T | E + T • T -> F | T * F • F -> x | y | ( E )
Ambiguity • Ambiguity is a property of grammars • A language can have both ambiguous and unambiguous grammars – see the previous slide • A grammar G is ambiguous iff some string in L(G) has two or more legal parse trees • That slide’s 2nd grammar disambiguates re both associativity and precedence • cf. T&N, Sections 2.1.4, 2.1.5, and 7.2.2
A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note the absence of rules for <digit> and <nondigit> • These are preterminal symbols • corresponding to token types to be identified by the scanner rather than the parser
if-statements in C (cf. Kernighan & Ritchie) • <selection-statement> ::= • if ( <expression> ) <statement> • [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols
if-statements in Ada • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;
statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> } • Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...
A BNF grammar for the Scheme language • <expression> ::= <atom> | <list> • <atom> ::= <literal> | <identifier> • <list> ::= () | ( <expressions> ) • <expressions> ::= <expression> | • <expression> <expressions> • Here <literal> and <identifier> are preterminals; the parentheses are terminals in types by themselves.
Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules
Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognizer s() with body • np( ); vp( );
Complications in recursive descent • scanning issues • RHSs with terminals • conflict between rules with the same LHS • optionality • including indefinite repetition • output • error handling
Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • If not, there is a syntax error in the input.
Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match( '(‘ ); • e( ); • match( ')‘ );
Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal primary in T&N, p. 79.
Optionality • Optionality (the use of brackets in EBNF) effectively gives multiple rules for the nonterminal on the LHS. • e.g., the factor recognizer, T&N, p. 79. • The same applies to indefinite repetition (the use of braces in EBNF) • Here the repetition may be handled by a while loop, • e.g. the term recognizer, T&N p. 79
Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • If the conditional's lookahead symbol • is in First(a), one case will be used • is in First(b), another case will be used • etc. • Here, First(X) is the set of terminals that may begin the yield of X.
The First function • For simple grammars, the First function may be easy to compute by hand. • T&N give a general algorithm for computing it -- which may be used even if the argument is a sequence of symbols. • In recursive descent parsing, First(X) must be disjoint from First(Y) for any two RHSs X and Y (for the same LHS) .
Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one or more of its rules, as in • E -> E + E | T • If the lookahead token t is also the first token of a string generated from T, the parser won’t know which E rule to apply.
Another potential problem • Another problem for recursive descent parsers arises from optionality. • Given a rule NP -> Det {Adj} N, there’d be a conflict between parsing rich as a N and an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. Louden) .
Abstract syntax trees • (Abstract) syntax trees can be better than parse trees as the interface between syntactic and semantic processing • e.g., T&N, Figures 2.10, 2.13, and 7.1 • cf. Section 2.5.1 & Section 7.2.1 • For syntax trees (unlike parse trees) • Nonterminal symbols needn’t appear • The form isn’t completely determined by the grammar