450 likes | 533 Views
Syntax. Juan Carlos Guzmán CS 3123 Programming Languages Concepts Southern Polytechnic State University. What does your DOS computer do when …?. > copy a.txt b.txt > copy a.txt a.txt > del *.* > del *01.* > type a.txt > null: > type a.txt > nul:.
E N D
Syntax Juan Carlos Guzmán CS 3123 Programming Languages Concepts Southern Polytechnic State University
What does your DOS computer do when …? • > copy a.txt b.txt • > copy a.txt a.txt • > del *.* • > del *01.* • > type a.txt > null: • > type a.txt > nul:
Semiotic • Synthesized from Merriam-Webster (m-w.com) • a general philosophical theory of signs and symbols that deals especially with their function in both artificially constructed and natural languages and comprises: • syntactics • the formal relations between signs or expressions in abstraction from their signification and their interpreters • semantics • the relations between signs and what they refer to • pragmatics • the relation between signs or linguistic expressions and their users
Syntax • Two levels: • The language level, properly known as parsing • The lexeme level, known as lexing • More information about this topic can be found in • Aho, Sethi, Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1988. (on reserve, The Dragon book)
Lexing • Specification of the lexemes of the language • A class of lexemes is known as a token • Tokens are specified in regular expressions: • letter, empty string • concatenation • choice • closure • Many convenient extensions • Recognized by Finite Automata • Limited in Power: cannot count, cannot recognize anbn
Sample Regular Expressions • digit::= (0 | 1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • ldigit::= (1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • natural::= ldigit digit* • integer::= (+ | - | ) (natural | 0) • How about floating points? • W/o exponents • add the exponents
Parsing • Specification of the language structure • The parser • recognizes the phrase, and • reconstructs its structure (parse tree)
Context-Free Grammars • Generate Context-Free Languages • Allow recursion • Are specified as G=(N,T,P,S) where • N is the set of “non-terminals”, or variables • T is the alphabet • P the “production set” • S the starting symbol for every phrase
CFG (Example) • G1 = ({S,A,B}, {a,b}, P, S) where P = {SASB, SBSA, S , A a, B b} • G2 = ({E}, {a,+,*,(,)}, P, E) where P = {EE+E, EE*E, Ea, E (E)}
Grammars (conventions) • The empty string: • First uppercase letters of the alphabet (A, B, C, …) => Non-terminal • First lowercase letters of the alphabet (a, b, c, …), or numbers (1, 2, …) => Terminal • First lowercase greek letters (, , ,…), => string of terminals and non-terminals • Last lowercase letters of the alphabet (t, u, v,…) => string of terminals
Derivation • How do we generate phrases in the language? • By using a derivation: A => iff A P • E => E+E => E+E*E => a+E*E => a+E*a => a+a*a
The Language Generated • The language generated by the grammar is composed of all strings of terminals that can be derived from S by applying productions rules one or more times • Anything derived from S is called a sentential form
Derivations • Leftmost derivation: the leftmost non-terminal is always reduced: E => E*E => E+E*E => a+E*E => a+a*E=> a+a*a • Rightmost derivation: the rightmost non-terminal is always reduced: E => E+E => E+E*E => E+E*a => E+a*a => a+a*a
E E E + E E * E a a E * E E + E a a a a Parse Tree • A structured sequence of derivations • Visually appealing • From previous example:
Ambiguous Grammar • Two different parse trees for a single phrase • Just one phrase with two trees is proof of ambiguity • Not ambiguous? All phrases must have only one parse tree! • An ambiguous grammar is quite different from an inherently ambiguous language
Grammars vs. Languages • A language is a set • A grammar is a medium by which the set can be formally specified • Many grammars specify the same set
An Expression Grammar • The grammar for expressions presented before was ambiguous • Non-ambiguous, with correct precedence (relative priority given to + and *): EE + T | T TT * F | F Fa | ( E ) E E + T T T * F a F F a a
Parsing Styles • Top-down: to derive w from S, start from S, derive until w is obtained • Bottom-up: to derive w from S, try doing ‘reverse derivations’ from w until S is obtained
Parsing Styles • Top-down: LL(k) • Easy to implement and understand • hand-coded • table-driven • Limited use, many problems • Bottom-up: LR(k) • More difficult to understand • table driven • A nice trade-off between complexity and generality
An Expression Grammar G = ({E,T,F},{a,+,*,(,)},P,E) where P = {ET+E | T, TF*T | F, Fa | (E) } Does a+a*a in L(G)? E T + E F T * T a F a F a
A Grammar for a Small Language programbeginstmt_listend stmt_liststmt stmt;stmt_list stmtvar=expression varABC expressionvar+var var-var var
Predictive Parsing • How many characters of look-ahead are needed to predict the next production to take? • Is this a finite number? • Is it 1?
Another Expression Grammar G’ = ({E,E’,T,T’,F},{a,+,*,(,)},P,E) where P = {ETE’, E’+TE’ | , TFT’, T’*FT’ | , Fa | (E) } Does a+a*a in L(G’)? E T E’ F T’ + T E’ a F T’ a * F T’ a
LL(1) Algorithm input stack Parse(a1 … an, X1 … Xm) { if (a1=$) & (X1=$) accept else if X1 is a terminal and (X1=a1) Parse(a2 … an, X2 … Xm) // match else if Table[X1,a1] = X1Y1 … Yk Parse(a1 … an, Y1 … YkX2 … Xm) / derive else fail } • Call initially with Parse(w$,S$), where w is the phrase to parse and S is the starting symbol of the grammar ai is a terminal Xj Yk are terminals or nonterminals
INPUT a + a * a $ a + a * a $ a + a * a $ a + a * a $ + a * a $ + a * a $ + a * a $ a * a $ a * a $ a * a $ * a $ * a $ a $ a $ $ $ $ STACK E $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ + T E’ $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ * F T’ E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ $ Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 OPERATION derive derive derive match derive derive match derive derive match derive match derive match derive derive accept Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $
Note how the leftmost derivation of a+a*a is done Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $ E T E’ F T’ + T E’ a F T’ a * F T’ a
What’s the Table Lookup • Note that the predictive nature of the parser guarantees the uniqueness of the entry for Table[A,b] (or no entry at all) • When attempting to derive nonterminal A, the look-ahead b must give the correct rule to apply • This b can be • the initial character of the derivation of A, i.e., A *b, • or, it can be the initial character of the derivation of what follows A! (A *)
First Sets • first() is the set of one-character prefixes of strings of terminals that can be derived from • If the empty string can be derived from , then it will also be in the set • if * aw then a first() • if * then first()
First Sets (II) • first() ={} • first(a) ={a} • first(A) = first(1) … first(n) if A 1 P, …, A n P • first(X) = first(X)first() where X is either terminal or nonterminal
Bounded Concatenation • In computing first(X), our interest is to obtain one-character prefixes (or ) • Consider the operation at the char level • = , where is either or a terminal • a = a • Generalize it to work on sets • AB = {vw | vA, wB}, where A & B are sets
Follow Sets • Follow(A) is the set of prefixes of strings of terminals that can follow any derivation of A in G • $ follow(S) • if(BA)P, then • first()follow(B) follow(A) • The definition of follow usually results in recursive set definitions. In order to solve them, you need to do several iterations on the equations • never appears in any follow set • Note: I had promised a closed definition of follow, but it will be unnecessarily complex. JCG.
How to Fill In the Table • For each production (A)P let X=first()follow(A) then for all xX B Table[A,x] • After processing all productions, each cell of the table must have, at most, one production • if not, your grammar is not LL(1) (nice try!)
Yet Another Expression Grammar (it’s in the book!) G = ({E,T,F},{a,+,*,(,)},P,E) where P = { EE+T, ET, TT*F, TF, F(E), Fa} Does a+a*a in L(G)? E E + T * T T F a F F a a
LR(1) Parsing Table Sn:shift to staten Rn:reduce according to productionn
LR(1) Algorithm stack input Parse(S0X1S1X2S2 … XrSr … XmSm,a1 … an) { if Action[Sm,a1] == Shift S Parse(S0X1S1X2S2 … XmSma1S,a2 … an) else if Action[Sm,a1] == Reduce AXr+1…Xm and GOTO[Sr,A] == S Parse(S0X1S1X2S2 … XrS,a1 … an) else if Action[Sm,a1] == Accept accept else if Action[Sm,a1] == Error error } • Call initially with Parse(S0,w$), where w is the phrase to parse and S0 is the initial state of the table ai is a terminal Xj Yk are terminals or nonterminals Si is a “state”
STACK 0 0 a 5 0 F 3 0 F 3 0 E 1 0 E 1 + 6 0 E 1 + 6 a 5 0 E 1 + 6 F 3 0 E 1 + 6 T 9 0 E 1 + 6 T 9 * 7 a 5 0 E 1 + 6 T 9 * 7 a 5 0 E 1 + 6 T 9 * 7 F 10 0 E 1 + 6 T 9 0 E 1 Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 INPUT a + a * a $ + a * a $ + a * a $ + a * a $ + a * a $ a * a $ * a $ * a $ * a $ a $ $ $ $ $ OPERATION S 5 R 6, G[0,F] R 4, G[0,T] R 2, G[0,E] R 6 S 5 R 6, G[6,F] R 4, G[6,T] S 7 S 5 R 6, G[7,F] R 3, G[7,T] R 1, G[0,E] accept Sentential Form a + a * a $ a + a * a $ F + a * a $ T + a * a $ E + a * a $ E + a * a $ E + a * a $ E + F * a $ E + T * a $ E + T * a $ E + T * a $ E + T * F $ E + T $ E $
Note how the rightmost derivation of a+a*a is done Sentential Form E $ E + T $ E + T * F $ E + T * a $ E + T * a $ E + T * a $ E + F * a $ E + a * a $ E + a * a $ E + a * a $ T + a * a $ F + a * a $ a + a * a $ a + a * a $ E E + T * T T F a F F a a