1 / 62

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis (based mainly on Tucker and Noonan; W

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis (based mainly on Tucker and Noonan; Watt and Brown). Fall 2010 Marco Valtorta mgv@cse.sc.edu Syntactic sugar causes cancer of the semicolon . A.Perlis. Contents. 3.1 Chomsky Hierarchy

munin
Download Presentation

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis (based mainly on Tucker and Noonan; W

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE 330Programming Language StructuresChapter 3: Lexical and Syntactic Analysis(based mainly on Tucker and Noonan; Watt and Brown) Fall 2010 Marco Valtorta mgv@cse.sc.edu Syntactic sugar causes cancer of the semicolon. A.Perlis

  2. Contents • 3.1 Chomsky Hierarchy • 3.2 Lexical Analysis • 3.3 Syntactic Analysis

  3. 3.1 Chomsky Hierarchy • Regular grammar -- least powerful • Context-free grammar (BNF) • Context-sensitive grammar • Unrestricted grammar

  4. Regular Grammar • Simplest; least powerful • Equivalent to: • Regular expression • Finite-state automaton • Right regular grammar:  T*, B  N A → B A → 

  5. Example • Integer→ 0 Integer | 1 Integer | ... | 9 Integer | 0 | 1 | ... | 9

  6. Regular Grammars • Left regular grammar: equivalent • Used in construction of tokenizers (scanners, lexers) • Less powerful than context-free grammars • Not a regular language { aⁿ bⁿ | n ≥ 1 } i.e., cannot balance: ( ), { }, begin end

  7. Context-free Grammars • BNF a stylized form of CFG • Equivalent to a pushdown automaton • For a wide class of unambiguous CFGs, there are table-driven, linear time parsers

  8. Context-Sensitive Grammars • Production: • α → β |α| ≤ |β| • α, β (N  T)* • i.e., left-hand side can be composed of strings of terminals and nonterminals

  9. Undecidable Properties of CSGs • Given a string  and grammar G: L(G) • L(G) is non-empty • Defn: Undecidable means that you cannot write a computer program that is guaranteed to halt to decide the question for all L(G).

  10. Unrestricted Grammar • Equivalent to: • Turing machine • von Neumann machine • C++, Java • That is, can compute any computable function.

  11. Contents • 3.1 Chomsky Hierarchy • 3.2 Lexical Analysis • 3.3 Syntactic Analysis

  12. Lexical Analysis • Purpose: transform program representation • Input: printable Ascii characters • Output: tokens • Discard: whitespace, comments • Defn: A token is a logically cohesive sequence of characters representing a single symbol.

  13. Example Tokens • Identifiers • Literals: 123, 5.67, 'x', true • Keywords: bool char ... • Operators: + - * / ... • Punctuation: ; , ( ) { }

  14. Other Sequences • Whitespace: space tab • Comments // any-char* end-of-line • End-of-line • End-of-file

  15. Why a Separate Phase? • Simpler, faster machine model than parser • 75% of time spent in lexer for non-optimizing compiler • Differences in character sets • End of line convention differs

  16. Regular Expressions • RegExpr Meaning • x a character x • \x an escaped character, e.g., \n • { name } a reference to a name • M | N M or N • M N M followed by N • M* zero or more occurrences of M

  17. RegExpr Meaning • M+ One or more occurrences of M • M? Zero or one occurrence of M • [aeiou] the set of vowels • [0-9] the set of digits • . Any single character

  18. Clite Lexical Syntax • CategoryDefinition • anyChar [ -~] • Letter [a-zA-Z] • Digit [0-9] • Whitespace [ \t] • Eol \n • Eof \004

  19. CategoryDefinition • Keyword bool | char | else | false | float |if | int | main | true | while • Identifier {Letter}({Letter} | {Digit})* • integerLit {Digit}+ • floatLit {Digit}+\.{Digit}+ • charLit ‘{anyChar}’

  20. CategoryDefinition • Operator = | || | && | == | != | < | <= | > | >= | + | - | * | / |! | [ | ] • Separator ;|.| { | } | ( | ) • Comment // ({anyChar} | {Whitespace})* {eol}

  21. Generators • Input: usually regular expression • Output: table (slow), code • C/C++: Lex, Flex • Java: JLex

  22. Finite State Automata • Set of states: representation – graph nodes • Input alphabet + unique end symbol • State transition function Labelled (using alphabet) arcs in graph • Unique start state • One or more final states

  23. Deterministic FSA • Defn: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol.

  24. A Finite State Automaton for Identifiers

  25. Definitions • A configuration on an FSA consists of a state and the remaining input. • A move consists of traversing the arc exiting the state that corresponds to the leftmost input symbol, thereby consuming it. If no such arc, then: • If no input and state is final, then accept. • Otherwise, error.

  26. An input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state.

  27. Example • (S, a2i$) ├ (I, 2i$) • ├ (I, i$) • ├ (I, $) • ├ (F, ) • Thus: (S, a2i$) ├* (F, )

  28. Some Conventions • Explicit terminator used only for program as a whole, not each token. • An unlabeled arc represents any other valid input symbol. • Recognition of a token ends in a final state. • Recognition of a non-token transitions back to start state.

  29. Recognition of end symbol (end of file) ends in a final state. • Automaton must be deterministic. • Drop keywords; handle separately. • Must consider all sequences with a common prefix together.

  30. Lexer Code • Parser calls lexer whenever it needs a new token. • Lexer must remember where it left off. • Greedy consumption goes 1 character too far • peek function • pushback function • no symbol consumed by start state

  31. From Design to Code • private char ch = ‘ ‘; • public Token next ( ) { • do { • switch (ch) { • ... • } • } while (true); • }

  32. Remarks • Loop only exited when a token is found • Loop exited via a return statement. • Variable ch must be global. Initialized to a space character. • Exact nature of a Token irrelevant to design.

  33. Translation Rules • Traversing an arc from A to B: • If labeled with x: test ch == x • If unlabeled: else/default part of if/switch. If only arc, no test need be performed. • Get next character if A is not start state

  34. A node with an arc to itself is a do-while. • Condition corresponds to whichever arc is labeled.

  35. Otherwise the move is translated to a if/switch: • Each arc is a separate case. • Unlabeled arc is default case. • A sequence of transitions becomes a sequence of translated statements.

  36. A complex diagram is translated by boxing its components so that each box is one node. • Translate each box using an outside-in strategy.

  37. private boolean isLetter(char c) { • return ch >= ‘a’ && ch <= ‘z’ || • ch >= ‘A’ && ch <= ‘Z’; • }

  38. private String concat(String set) { • StringBuffer r = new StringBuffer(“”); • do { • r.append(ch); • ch = nextChar( ); • } while (set.indexOf(ch) >= 0); • return r.toString( ); • }

  39. public Token next( ) { • do { if (isLetter(ch) { // ident or keyword • String spelling = concat(letters+digits); • return Token.keyword(spelling); • } else if (isDigit(ch)) { // int or float literal • String number = concat(digits); • if (ch != ‘.’) • return Token.mkIntLiteral(number); • number += concat(digits); • return Token.mkFloatLiteral(number);

  40. } else switch (ch) { • case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh: • ch = nextCh( ); break; • case eofCh: return Token.eofTok; • case ‘+’: ch = nextChar( ); • return Token.plusTok; • … • case ‘&’: check(‘&’); return Token.andTok; • case ‘=‘: return chkOpt(‘=‘, Token.assignTok, • Token.eqeqTok);

  41. // a first program // with 2 comments int main ( ) { char c; int i; c = 'h'; i = c + 3; } // main int main ( ) { char Identifier c ; Source Tokens

  42. JLex: A Lexical Analyzer Generator for Java We will look at an example JLex specification (adopted from the manual). Consult the manual for details on how to write your own JLex specifications. Definition of tokens Regular Expressions JLex Java File: Scanner Class Recognizes Tokens

  43. The JLex tool Layout of JLex file: user code (added to start of generated file) %% options %{ user code (added inside the scanner class declaration) %} macro definitions %% lexical declaration User code is copied directly into the output class JLex directives allow you to include code in the lexical analysis class, change names of various components, switch on character counting, line counting, manage EOF, etc. Macro definitions gives names for useful regexps Regular expression rules define the tokens to be recognised and actions to be taken

  44. Java.io.StreamTokenizer • An alternative to JLex is to use the classStreamTokenizerfrom java.io • The class recognizes 4 types of lexical elements (tokens): • number (sequence of decimal numbers eventually starting with the –(minus) sign and/or containing the decimal point) • word (sequence of characters and digits starting with a character) • line separator • end of file

  45. Parsing • Some terminology • Different types of parsing strategies • bottom up • top down • Recursive descent parsing • What is it • How to implement one given an EBNF specification • (How to generate one using tools – later) • (Bottom up parsing algorithms)

  46. Parsing: Some Terminology • Recognition To answer the question “does the input conform to the syntax of the language?” • Parsing Recognition + determination of phrase structure (for example by generating AST data structures) • (Un)ambiguous grammar: A grammar is unambiguous if there is only at most one way to parse any input (i.e. for syntactically correct program there is precisely one parse tree)

  47. Different kinds of Parsing Algorithms • Two big groups of algorithms can be distinguished: • bottom up strategies • top down strategies • Example parsing of “Micro-English” Sentence ::= Subject Verb Object . Subject ::= I | aNoun | theNoun Object ::= me | aNoun | the Noun Noun ::= cat | mat| rat Verb ::= like| is | see | sees The cat sees the rat. The rat sees me. I like a cat The rat like me. I see the rat. I sees a rat.

  48. Sentence Subject Subject Verb Verb Object Object . Noun Noun Noun Noun The cat sees a rat . Top-down parsing The parse tree is constructed starting at the top (root). Sentence The The cat cat sees sees a rat rat . .

More Related