320 likes | 460 Views
Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs). We care about. Completeness of specification Determination of legal expressions Resolution of ambiguities Avoid small mistakes causing major errors. BNF/CFG. Backus-Naur Form Context Free Grammars
E N D
Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs)
We care about • Completeness of specification • Determination of legal expressions • Resolution of ambiguities • Avoid small mistakes causing major errors
BNF/CFG • Backus-Naur Form • Context Free Grammars • Similar mechanisms for specifying legal syntax • Both focused on production rules
A language is a set S of sentences composed of words from a vocabularyΣ formed according to rules of grammar G. (informally) • e.g.: Let Σ = {Bill, Tom, likes, eats, cheese, rice}
<sentence> is <subject><verb><object> <subject> is Bill or Tom <verb> is likes or eats <object> is cheese or rice • To see if a sentence is legal, we parse it (possibly using a parse tree)
<sentence> is <subject><verb><object> <subject> is Bill or Tom <verb> is likes or eats <object> is cheese or rice • To see if a sentence is legal, we parse it (possibly using a parse tree) • <> around non-terminal symbols – things that are not in the vocabulary • Other words are terminal symbols – things in the vocabulary
Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (characters, words)
Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (characters, words) • A string or sentence is a sequence of (0 or more) symbols from an alphabet or vocabulary
Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (words, characters) • A string or sentence is a sequence of (0 or more) symbols from an alphabet • The empty string λ has no characters • A dictionary over an alphabet Σ, denoted Σ*, is the set of all strings formed from the symbols of Σ • A language L (over an alphabet Σ) is a subset of Σ*, restricted by certain sytactical rules (of grammar)
Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y
Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y • Let x = 001 and y = 01001 • What is |x|? • What is |y|? • What is xy? • What is |xy|?
Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y • xi is the string of i instances of x concatenated • x* (Kleene closure) is the set of all concatenations of 0 or more instances of x • x+ (positive closure) is the set of all concatenations of 1 or more instances of x
Sets of strings • A = {0, 1}* • B = {01, 1000, 1} • C = {01, 1} • Describe the following sets of strings in words • A • AB • C* • B2 • C+
Grammars • A (phrase-structure) grammar (PSG) is a quadruple (N, T, P, S) where • N is the set of non-terminal symbols • T (=Σ) is the set of terminal symbols • P is the set of productions (rewrite rules) • S (in N) is the start symbol • The language L(G) defined by the grammar G is the set of sentences that can be derived from the start symbol S by applying productions in P • A sentential form of G is S or any string that can be derived from S • A sentence of G is a sentential form that contains only terminals
N = {<sentence>, <subject>, <verb>, <object>} • T = {Bill, Tom, likes, eats, cheese, rice} • S = <sentence> • P = { <sentence> <subject><verb><object> <subject> Bill <subject> Tom <verb> likes <verb> eats <object> cheese <object> rice }
BNF (Backus-Naur Form) • A type of grammar and a notation for expressing them. • 4 meta-symbols: ::=, <, >, | • ::= replaces • <, > surround non-teminals • | indicates alternatives • Things not in <, > are terminals • Left-hand side MUST be a single non-terminal • This is a very important restriction! • This restriction on its own classifies a PSG as a CFG
The grammar using BNF N = {<sentence>, <subject>, <verb>, <object>} • T = {Bill, Tom, likes, eats, cheese, rice} • S = <sentence> • P = { <sentence> ::= <subject><verb><object> <subject> ::= Bill | Tom <verb> ::= likes | eats <object> ::= cheese | rice }
How does this apply to programming languages? • https://docs.python.org/3/reference/grammar.html • http://inst.eecs.berkeley.edu/~cs164/fa11/python-grammar.html
Extended BNF • [ ] enclose optional items • { } enclose a string which can be repeated 0 or more times • { }+ enclose a string which can be repeated 1 or more times • { }ij enclose a string which can be repeated between i and j times (inclusive) • You may sometimes see slightly different notation; non-terminals denoted by uppercase letters, instead of ::=, “” around terminals, etc. Use your noggin :) • When would you want one or the other?
<sentence> ::= <subject><verb><object> <subject> ::= Bill | Tom <verb> ::= likes | eats <object> ::= cheese | rice • To see if a sentence is legal, we parse it (possibly using a parse tree)
Ambiguity • A grammar that produces more than one derivation for the same sentence is ambiguous • Eg: E E+E | E*E | (E) | -E | id • There are two derivations for id+id*id – find them both.
Ambiguity continued • There are some languages for which there is no unambiguous grammar • HOWEVER there are some rules of thumb we can use to help us deal with many ambiguous grammars • Ambiguity often arises if the right side of a production rule contains 2 or more occurrences of the same non-terminal • Ambiguity can cause problems if the grammar needs to observe rules of precedence or associativity
E E+E | E*E | (E) | -E | id • E E+T
Derivations • A leftmost derivation is one in which only the leftmost non-terminal in a sentential form is replaced at each step. • A rightmost derivation is one in which only the rightmost non-terminal in a sentential form is replaced at each step • Remember ambiguity? Leftmost and rightmost derivations are usually unique • Generally pick leftmost or rightmost and stick with it • if they are different the language is ambiguous (not iff)
Parse trees are central to compiler theory • They allow us to identify the production rule corresponding to the chunk of code, and from there replace that chunk of code with a chunk of assembly.
Phases of compilation lexical analysis (tokenization) – grouping the characters into tokens (int, {, x, etc) (linear scan) Syntax analysis (parsing) – grouping tokens into expressions or statements ( int x = 10;) (recursion; CFG) semantic analysis (Syntax-directed translation), - type checking, implicit typecasting, check indices to arrays, variables declared before use, etc. Necessary because most programming languages can't be completely captured by CFG code generation (next) – actually generating assembler code code optimization – looking for ways to make the assembler faster
Code generation: Using the parse tree to generate assembler code To prune the leaves of a parse tree means to eliminate all the leaves of a node, replacing the leaves (and their parent) with the intended “meaning” (a value or a chunk of code)
Compilation vs interpretation An interpreted language is a programming language for which most of its implementations execute instructions directly, without previously compiling a program into machine-language instructions. The interpreter executes the program directly, translating each statement into a sequence of one or more subroutines already compiled into machine code.