520 likes | 683 Views
Discrete Maths. 242-213 , Semester 2, 2013-2014. Objectives to introduce grammars and show their importance for defining programming languages; to show the connection between REs and grammars. 14 . Grammars. Overview. Why Grammars? Languages Using a Grammar Parse Trees
E N D
Discrete Maths 242-213, Semester 2, 2013-2014 • Objectives • to introduce grammars and show their importance for defining programming languages; • to show the connection between REs and grammars 14. Grammars
Overview • Why Grammars? • Languages • Using a Grammar • Parse Trees • Ambiguous Grammars • Kinds of Grammars • More Information
1. Why Grammars? • Grammars are the standard way of defining programming languages. • Tools exist for semi-autiomatically translating grammars into compilers (e.g. JavaCC, lex, yacc, ANTLR) • this saves weeks of work
2. Languages • We use a natural language to communicate • its grammar rules are very complex • the rules don’t cover important things • We use a formal language to define a programming language • its grammar rules are fairly simple • the rules cover almost everything continued
A formal language is a set of legal strings. • The strings are legal if they correctly use the language’s alphabet and grammar rules. • The alphabet is often called the language’s terminal symbols (or terminals).
Example 1 not shown here; see later • Alphabet (terminals) = {1, 2, 3} • Using the grammar rules, the language is: L1 = { 11, 12, 13, 21, 22, 23, 31, 32, 33} • L1 is the set of strings of length 2.
Example 2 • Terminals = {1, 2, 3} • Using different grammar rules, the language is: L2 = { 111, 222, 333} • L2 is the set of strings of length 3, where all the terminals are the same.
Example 3 • Terminals = {1, 2, 3} • Using different grammar rules, the language is: L3 = {2, 12, 22, 32, 112, 122, 132, ...} • L3 is the set of strings whose numerical value is divisible by 2.
3. Using a Grammar • A grammar is a notation for defining a language, and is made from 4 parts: • the terminal symbols • the syntactic categories (nonterminal symbols) • e.g. statement, expression, noun, verb • the grammar rules (productions) • e,g, A => B1 B2 ... Bn • the starting nonterminal • the top-most syntactic category for this grammar continued
We define a grammar G as a 4-tuple: G = (T, N, P, S) • T = terminal symbols • N = nonterminal symbols • P = productions • S = starting nonterminal
3.1. Example 1 • Consider the grammar: T = {0, 1} N = {S, R} P = { S => 0 S => 0 R R => 1 S } S is the starting nonterminal the right hand sides of productions usually use a mix of terminals and nonterminals
Is “01010” in the language? • Start with a S rule: • Rule String Generated-- SS => 0 R 0 RR => 1 S 0 1 SS => 0 R 0 1 0 RR => 1 S 0 1 0 1 SS => 0 0 1 0 1 0 • No more rules can be applied since there are no more nonterminals left in the string. Yes, it is in the language.
Example 2 • Consider the grammar: T = {a, b, c, d, z} N = {S, R, U, V} P = { S => R U z | z R => a | b R U => d V U | c V => b | c } S is the starting nonterminal
The notation: X => Y | Z is shorthand for the two rules: X => YX => Z • Read ‘|’ as ‘or’.
Is “adbdbcz” in the language? • Rule String Generated-- SS => R U z R U zR => a a U zU => d V U a d V U zV => b a d b U zU => d V U a d b d V U zV => b a d b d b U zU => c a d b d b c z Yes! This grammar has choices about how to rewrite the string.
Is “abdbcz” in the language? No • Rule String Generated-- SS => R U z R U zR => a a U zwhich U rule? • U must be replaced by something beginning with a ‘b’, but the only U rule is: U => d V U | c
3.2. BNF • BNF is a shorthand notation for productions • Backus Normal Form, or • Backus-Naur Form • We have already used ‘|’: X => Y1 | Y2 | ... | Yn John Backus (1924 – 2007) Peter Naur (1928 – ) continued
X => Y [Z]is shorthand for two rules: X => YX => Y Z • [Z] means 0 or 1 occurrences of Z. continued
X => Y { Z }is shorthand for an infinite number of rules: X => YX => Y ZX => Y Z ZX => Y Z Z Z : • { Z } means 0 or more occurrences of Z.
3.3. A Grammar for Expressions • Consider the grammar: T = { 0, 1, 2,..., 9, +, -, *, /, (, ) } N = { Expr, Number } P = { Expr => Number Expr => ( Expr ) Expr => Expr + Expr | Expr - Expr | Expr * Expr | Expr / Expr } Expr is the starting nonterminal
Defining Number • The RE definition for a number is: number = digit digit*digit = [0-9] • The productions for Number are: Number => Digit { Digit }Digit => 0 | 1 | 2 | 3 | … | 9 orNumber => Number Digit | DigitDigit => 0 | 1 | 2 | 3 | ... | 9
Using Productions • Expand Expr into (125-2)*3 Expr => Expr * Expr => ( Expr ) * Expr => ( Expr - Expr ) * Expr => ( Number - Number ) * Number : => ( 125 - 2 ) * 3 continued
Expand Number into 125 Number => Number Digit => Number Digit Digit => Digit Digit Digit => 1 2 5
3.4. Grammars are not Unique • Two grammars that do the same thing: Balanced => eBalanced => ( Balanced ) Balanced and: Balanced => eBalanced => ( Balanced )Balanced => Balanced Balanced • Both generate the same strings: (()(())) () e (()())
4. Parse Trees • A parse tree is a graphical way of showing how productions are used to generate a string. • Data structures representing parse trees are used inside compilers to store information about the program being compiled.
Example 1 • Consider the grammar: T = { a, b } N = { S } P = { S => S S | a S b | a b | b a } S is the starting nonterminal
expand the symbol in the circle Parse Tree for “aabbba” S The root of the tree is the start symbol S: Expand using S => S S S S S Expand using S => a S b continued
S S S S a b Expand using S => a b S S S a S b a b Expand using S => b a continued
Stop when there are no more nonterminals in leaf positions. Read off the string by reading the leaves left to right. S S S a b a S b a b
Example 2 • Consider the grammar: T = { a, +, *, (, ) } N = { E, T, F } P = { E => T | T + E T => F | F * T F => a | ( E ) } E is the starting nonterminal
Is “a+a*a” in the Language? E Expand using E => T + E E T + E Expand using T => F E T + E F continued
Continue expansion until: E T + E F T a * T F a F a
5. Ambiguous Grammars • A grammar is ambiguous when a string can be represented by more than one parse tree • it means that the string has more than one “meaning” in the language • e.g. a variant of the last grammar example: P = { E => E + E | E * E | ( E ) | a }
Parse Trees for “a+a*a” E E E E + E * E and a E a E + E * E a a a a continued
The two parse trees allow a string like “5+5*5” to be read in two different ways: • 5+ 25 (the left hand tree) • 10*5 (the right hand tree)
Why is Ambiguity Bad? • In a programming language, a string with more than one meaning means that the compiler and run-time system will not know how to process it. • e.g in C: x = 5 + 5 * 5;// what is the value in x?
6. Kinds of Grammars • There are 4 main kinds of grammar, of increasing expressive power: • regular (type 3) grammars • context-free (type 2) grammars • context-sensitive (type 1) grammars • unrestricted (type 0) grammars • They vary in the kinds of productions they allow. Avram Noam Comsky (1928 – )
6.1. Regular Grammars S => wTT => xTT => a • Every production is of the form: A => a | a B | e • A, B are nonterminals, a is a terminal • These are sometimes called right linear rules because if a nonterminal appears in the rule body, then it must appear last. • Regular grammars are equivalent to REs (and also to automata).
An Equivalence Diagram Regular Grammars Automata same expressive power REs
Example • Integer => + UInt | - UInt | 0 Digits | 1 Digits | ... | 9 DigitsUInt => 0 Digits | 1 Digits | ... | 9 DigitsDigits => 0 Digits | 1 Digits | ... | 9 Digits | e
6.2. Context-Free Grammars A => aA => aBcdB => ae • Every production is of the form: A => d • A is a nonterminal, d can be any number of nonterminals or terminals • Most of our examples have been context-free grammars • used widely to define programming languages • they subsume regular grammars
6.3. Context-Sensitive Grammars A => a11A => aB2dB2 => ae • Every production is of the form: a => d • a, d can contain any number of terminals and nonterminals • a must contain at least 1 nonterminal • size(d) >= size(a) • d cannot bee continued
Context-sensitive rules allow the grammar to specify a context for a rewrite • e.g. A1a0 => 1b00 • the string 2A1a01 becomes 21b001 • Context-sensitive grammars are more powerful than context-free grammars because of this context ability.
Example • The language: E = {012, 001122, 000111222, ... } or, in brief, E = {0n 1n 2n | n >= 1} can only be expressed using a context-sensitive grammar: S => 0 A 1 2 | 0 1 2 A => 0 A 1 C | 0 1 C C 1 => 1 C C 2 => 2 2
Rewrite S to 001122 • S => O A 1 2 0 A 1 2 => 0 0 1 C 1 2 0 0 1 C 12 => 0 0 1 1 C 2 0 0 1 1 C 2 => 0 0 1 1 2 2
6.4. Unrestricted Grammars A => e11A => aB2 => aeA • Every production is of the form: a => d • a, d can contain any number of terminals and nonterminals; a must contain at least 1 nonterminal • no restrictions on size(d) • it may be smaller than size(a) • d can bee • Also called phrase-structure grammars. more general than context sensitive
Example • The language: E = {e, 012, 001122, 000111222, ... } or, in brief, E = {0n 1n 2n | n >= 0} can only be expressed using an unrestricted grammar: S => 0 A 1 2 | e A => 0 A 1 C | e C 1 => 1 C C 2 => 2 2 new features
Rewrite S to 012 • S => 0 A 1 2 • 0 A 1 2 => 0 1 2 • using A ==> e
6.5. Why so many Grammar Kinds? • More powerful grammars are more expressive, but also harder to implement efficiently • a trade-off between power and implementation continued
For example, most compilers have two grammar-based components: • the lexical analyzer • uses REs (regular grammars) to parse basic nonterminals such as identifier and number • the syntax analyzer • uses (context-free) grammars to deal with complex syntactic categories such as loops and expressions