1.67k likes | 2.11k Views
Discrete Maths. 241-303 , Semester 1 2014-2015. Objectives to introduce grammars and show their importance for defining programming languages and writing compilers; to show the connection between REs and grammars. 8 . Grammars. Overview. 1. Why Grammars? 2. Languages
E N D
Discrete Maths 241-303, Semester 12014-2015 • Objectives • to introduce grammars and show their importance for defining programming languages and writing compilers; • to show the connection between REs and grammars 8. Grammars
Overview 1. Why Grammars? 2. Languages 3. Using a Grammar 4. Parse Trees 5. Ambiguous Grammars 6. Top-down and Bottom-up Parsing continued
7. Building Recursive Descent Parsers 8. Making the Translation Easy 9. Building a Parse Tree 10. Kinds of Grammars 11. From RE to a Grammar 12. Context-free Grammars vs. REs
1. Why Grammars? • Grammars are the standard way of defining programming languages. • Tools exist for translating grammars into compilers (e.g. JavaCC, lex, yacc, ANTLR) • this saves weeks of work
2. Languages • We use a natural language to communicate • its grammar rules are very complex • the rules don’t cover important things • We use a formal language to define a programming language • its grammar rules are fairly simple • the rules cover almost everything continued
A formal language is a set of legal strings. • The strings are legal if they correctly use the language’s alphabet and grammar rules. • The alphabet is often called the language’s terminal symbols (or terminals).
Example 1 not shown here; see later • Alphabet (terminals) = {1, 2, 3} • Using the grammar rules, the language is: L1 = { 11, 12, 13, 21, 22, 23, 31, 32, 33} • L1 is the set of strings of length 2.
Example 2 • Terminals = {1, 2, 3} • Using different grammar rules, the language is: L2 = { 111, 222, 333} • L2 is the set of strings of length 3, where all the terminals are the same.
Example 3 • Terminals = {1, 2, 3} • Using different grammar rules, the language is: L3 = {2, 12, 22, 32, 112, 122, 132, ...} • L3 is the set of strings whose numerical value is divisible by 2.
3. Using a Grammar • A grammar is a notation for defining a language, and is made from 4 parts: • the terminal symbols • the syntactic categories (nonterminal symbols) • e.g. statement, expression, noun, verb • the grammar rules (productions) • e,g, A => B1 B2 ... Bn • the starting nonterminal • the top-most syntactic category for this grammar continued
We define a grammar G as a 4-tuple: G = (T, N, P, S) • T = terminal symbols • N = nonterminal symbols • P = productions • S = starting nonterminal
3.1. Example 1 • Consider the grammar: T = {0, 1} N = {S, R} P = { S => 0 S => 0 R R => 1 S } S is the starting nonterminal the right hand sides of productions usually use a mix of terminals and nonterminals
Is “01010” in the language? • Start with a S rule: • Rule String Generated-- SS => 0 R 0 RR => 1 S 0 1 SS => 0 R 0 1 0 RR => 1 S 0 1 0 1 SS => 0 0 1 0 1 0 • No more rules can be applied since there are no more nonterminals left in the string. Yes, it is in the language.
Example 2 • Consider the grammar: T = {a, b, c, d, z} N = {S, R, U, V} P = { S => R U z | z R => a | b R U => d V U | c V => b | c } S is the starting nonterminal
The notation: X => Y | Z is shorthand for the two rules: X => YX => Z • Read ‘|’ as ‘or’.
Is “adbdbcz” in the language? • Rule String Generated-- SS => R U z R U zR => a a U zU => d V U a d V U zV => b a d b U zU => d V U a d b d V U zV => b a d b d b U zU => c a d b d b c z Yes! This grammar has choices about how to rewrite the string.
Is “abdbcz” in the language? No • Rule String Generated-- SS => R U z R U zR => a a U zwhich U rule? • U must be replaced by something beginning with a ‘b’, but the only U rule is: U => d V U | c
3.2. BNF • BNF is a shorthand notation for productions • Backus Normal Form, or • Backus-Naur Form • We have already used ‘|’: X => Y1 | Y2 | ... | Yn continued
X => Y [Z]is shorthand for two rules: X => YX => Y Z • [Z] means 0 or 1 occurrences of Z. continued
X => Y { Z }is shorthand for an infinite number of rules: X => YX => Y ZX => Y Z ZX => Y Z Z Z : • { Z } means 0 or more occurrences of Z.
3.3. A Grammar for Expressions • Consider the grammar: T = { 0, 1, 2,..., 9, +, -, *, /, (, ) } N = { Expr, Number } P = { Expr => Number Expr => ( Expr ) Expr => Expr + Expr | Expr - Expr | Expr * Expr | Expr / Expr } Expr is the starting nonterminal
Defining Number • The RE definition for a number is: number = digit digit*digit = [0-9] • The productions for Number are: Number => Digit { Digit }Digit => 0 | 1 | 2 | 3 | … | 9 orNumber => Number Digit | DigitDigit => 0 | 1 | 2 | 3 | ... | 9
Using Productions • Expand Expr into (125-2)*3 Expr => Expr * Expr => ( Expr ) * Expr => ( Expr - Expr ) * Expr => ( Number - Number ) * Number : => ( 125 - 2 ) * 3 continued
Expand Number into 125 Number => Number Digit => Number Digit Digit => Digit Digit Digit => 1 2 5
3.4. Grammars are not Unique • Two grammars that do the same thing: Balanced => eBalanced => ( Balanced ) Balanced and: Balanced => eBalanced => ( Balanced )Balanced => Balanced Balanced • Both generate the same strings: (()(())) () e (()())
3.5. Productions for parts of C • Control structures: Statement => while ( Cond ) StatementStatement => if ( Cond ) StatementStatement => if ( Cond ) Statement else Statement • Testing (conditionals): Cond => Expr < Expr | Expr > Expr | ... continued
Statement blocks: Statement => ‘{‘ StatList ‘}’ StatList => Statement ; StatList | Statement ;
Using the Statement Production Statement => while ( Cond ) Statement => while ( Expr < Expr ) Statement => while ( Expr < Expr ) { StatList } => while ( Expr < Expr ) { Statement ; Statement ; } : => while (x < 10) { y++; x++; } • This example requires an extra Expr production for variables: Expr => VariableName
3.6. Generating a Language • For a given grammar, what strings can it generate? • the language is the set of legal strings • Most languages contain an infinite number of strings (e.g. English) • but there is a process for generating them continued
For each production, list the strings that can be derived immediately. • On the 2nd round, put those strings back into the productions to generate more strings. • On the 3rd round, put those strings back... • Continue for as many rounds as you want.
Example • Consider the grammar: T = { w, c, s, ‘{‘, ‘}’, ‘;’ } N = { S, L } P = { S => w c S | ‘{‘ L ‘}’ | s ‘;’ L => L S | e } S is the starting nonterminal
Strings in First 3 Rounds S L Round 1: s; e Round 2: wcs; {} s; Round 3: wcwcs;wc{}{s;} wcs;{}s;s;s;wcs;s;{}
4. Parse Trees • A parse tree is a graphical way of showing how productions are used to generate a string. • Data structures representing parse trees are used inside compilers to store information about the program being compiled.
Example 1 • Consider the grammar: T = { a, b } N = { S } P = { S => S S | a S b | a b | b a } S is the starting nonterminal
Parse Tree for “aabbba” expand the symbol in the circle S The root of the tree is the start symbol S: Expand using S => S S S S S Expand using S => a S b continued
S S S S a b Expand using S => a b S S S a S b a b Expand using S => b a continued
Stop when there are no more nonterminals in leaf positions. Read off the string by reading the leaves left to right. S S S a b a S b a b
Example 2 • Consider the grammar: T = { a, +, *, (, ) } N = { E, T, F } P = { E => T | T + E T => F | F * T F => a | ( E ) } E is the starting nonterminal
Is “a+a*a” in the Language? E Expand using E => T + E E T + E Expand using T => F E T + E F continued
Continue expansion until: E T + E F T a * T F a F a
5. Ambiguous Grammars • A grammar is ambiguous when a string can be represented by more than one parse tree • it means that the string has more than one “meaning” in the language • e.g. a variant of the last grammar example: P = { E => E + E | E * E | ( E ) | a }
Parse Trees for “a+a*a” E E E E + E * E and a E a E + E * E a a a a continued
The two parse trees allow a string like “5+5*5” to be read in two different ways: • 5+ 25 (the left hand tree) • 10*5 (the right hand tree)
Why is Ambiguity Bad? • In a programming language, a string with more than one meaning means that the compiler and run-time system will not know how to process it. • e.g in C: x = 5 + 5 * 5;// what is the value in x?
6. Top-down and Bottom-up Parsing • Top-down parsing creates a parse tree starting from the start symbol and moves down towards the leaves. • used in most compilers • usually implemented as recursive-descent parsing continued
Bottom-up parsing creates a parse tree starting from the leaves, and moves up towards the start symbol. • productions are used in ‘reverse’ • Both kinds of parsing often require “guessing” to decide which productions to use to parse a string.
Example • Consider the grammar: T = { a, +, *, (, ) } N = { E, T, F } P = { E => T | T + E T => F | F * T F => a | ( E ) } E is the starting nonterminal
Top-down Parse of “a+a*a” E T + E F T Top-down a * T F a F a
Bottom-up Parse of “a+a*a” E T + E F T Bottom-up a * T F a F a
Guessing when Building • Guessing occurs when there are several rules which can apply to the current nonterminal. • Compilers are very bad at guessing, and so program language designers try to make grammars as simple as possible.