Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs)

Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs)

We care about • Completeness of specification • Determination of legal expressions • Resolution of ambiguities • Avoid small mistakes causing major errors

BNF/CFG • Backus-Naur Form • Context Free Grammars • Similar mechanisms for specifying legal syntax • Both focused on production rules

A language is a set S of sentences composed of words from a vocabularyΣ formed according to rules of grammar G. (informally) • e.g.: Let Σ = {Bill, Tom, likes, eats, cheese, rice}

<sentence> is <subject><verb><object> <subject> is Bill or Tom <verb> is likes or eats <object> is cheese or rice • To see if a sentence is legal, we parse it (possibly using a parse tree)

<sentence> is <subject><verb><object> <subject> is Bill or Tom <verb> is likes or eats <object> is cheese or rice • To see if a sentence is legal, we parse it (possibly using a parse tree) • <> around non-terminal symbols – things that are not in the vocabulary • Other words are terminal symbols – things in the vocabulary

Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (characters, words)

Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (characters, words) • A string or sentence is a sequence of (0 or more) symbols from an alphabet or vocabulary

Definitions • An alphabet or vocabularyΣ is a (usually finite) set of symbols (words, characters) • A string or sentence is a sequence of (0 or more) symbols from an alphabet • The empty string λ has no characters • A dictionary over an alphabet Σ, denoted Σ*, is the set of all strings formed from the symbols of Σ • A language L (over an alphabet Σ) is a subset of Σ*, restricted by certain sytactical rules (of grammar)

Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y

Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y • Let x = 001 and y = 01001 • What is |x|? • What is |y|? • What is xy? • What is |xy|?

Strings • The length of a string x, |x|, is the total number of symbols in x. • The concatenation of string x and string y, xy, is the string composed of symbols of x followed by symbols of y • xi is the string of i instances of x concatenated • x* (Kleene closure) is the set of all concatenations of 0 or more instances of x • x+ (positive closure) is the set of all concatenations of 1 or more instances of x

Sets of strings • A = {0, 1}* • B = {01, 1000, 1} • C = {01, 1} • Describe the following sets of strings in words • A • AB • C* • B2 • C+

Grammars • A (phrase-structure) grammar (PSG) is a quadruple (N, T, P, S) where • N is the set of non-terminal symbols • T (=Σ) is the set of terminal symbols • P is the set of productions (rewrite rules) • S (in N) is the start symbol • The language L(G) defined by the grammar G is the set of sentences that can be derived from the start symbol S by applying productions in P • A sentential form of G is S or any string that can be derived from S • A sentence of G is a sentential form that contains only terminals

N = {<sentence>, <subject>, <verb>, <object>} • T = {Bill, Tom, likes, eats, cheese, rice} • S = <sentence> • P = { <sentence> <subject><verb><object> <subject> Bill <subject> Tom <verb> likes <verb> eats <object> cheese <object> rice }

BNF (Backus-Naur Form) • A type of grammar and a notation for expressing them. • 4 meta-symbols: ::=, <, >, | • ::= replaces • <, > surround non-teminals • | indicates alternatives • Things not in <, > are terminals • Left-hand side MUST be a single non-terminal • This is a very important restriction! • This restriction on its own classifies a PSG as a CFG

The grammar using BNF N = {<sentence>, <subject>, <verb>, <object>} • T = {Bill, Tom, likes, eats, cheese, rice} • S = <sentence> • P = { <sentence> ::= <subject><verb><object> <subject> ::= Bill | Tom <verb> ::= likes | eats <object> ::= cheese | rice }

How does this apply to programming languages? • https://docs.python.org/3/reference/grammar.html • http://inst.eecs.berkeley.edu/~cs164/fa11/python-grammar.html

Extended BNF • [ ] enclose optional items • { } enclose a string which can be repeated 0 or more times • { }+ enclose a string which can be repeated 1 or more times • { }ij enclose a string which can be repeated between i and j times (inclusive) • You may sometimes see slightly different notation; non-terminals denoted by uppercase letters, instead of ::=, “” around terminals, etc. Use your noggin :) • When would you want one or the other?

<sentence> ::= <subject><verb><object> <subject> ::= Bill | Tom <verb> ::= likes | eats <object> ::= cheese | rice • To see if a sentence is legal, we parse it (possibly using a parse tree)

Ambiguity • A grammar that produces more than one derivation for the same sentence is ambiguous • Eg: E E+E | E*E | (E) | -E | id • There are two derivations for id+id*id – find them both.

Ambiguity continued • There are some languages for which there is no unambiguous grammar • HOWEVER there are some rules of thumb we can use to help us deal with many ambiguous grammars • Ambiguity often arises if the right side of a production rule contains 2 or more occurrences of the same non-terminal • Ambiguity can cause problems if the grammar needs to observe rules of precedence or associativity

E E+E | E*E | (E) | -E | id

E E+E | E*E | (E) | -E | id • E E+T

E E+E | E*E | (E) | -E | id

Derivations • A leftmost derivation is one in which only the leftmost non-terminal in a sentential form is replaced at each step. • A rightmost derivation is one in which only the rightmost non-terminal in a sentential form is replaced at each step • Remember ambiguity? Leftmost and rightmost derivations are usually unique • Generally pick leftmost or rightmost and stick with it • if they are different the language is ambiguous (not iff)

Leftmost/Rightmost derivation example

Parse trees are central to compiler theory • They allow us to identify the production rule corresponding to the chunk of code, and from there replace that chunk of code with a chunk of assembly.

Phases of compilation lexical analysis (tokenization) – grouping the characters into tokens (int, {, x, etc) (linear scan) Syntax analysis (parsing) – grouping tokens into expressions or statements ( int x = 10;) (recursion; CFG) semantic analysis (Syntax-directed translation), - type checking, implicit typecasting, check indices to arrays, variables declared before use, etc. Necessary because most programming languages can't be completely captured by CFG code generation (next) – actually generating assembler code code optimization – looking for ways to make the assembler faster

Code generation: Using the parse tree to generate assembler code To prune the leaves of a parse tree means to eliminate all the leaves of a node, replacing the leaves (and their parent) with the intended “meaning” (a value or a chunk of code)

Compilation vs interpretation An interpreted language is a programming language for which most of its implementations execute instructions directly, without previously compiling a program into machine-language instructions. The interpreter executes the program directly, translating each statement into a sequence of one or more subroutines already compiled into machine code.

Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs)

Compilation: Backus-Naur Form (BNF) and Context Free Grammars (CFGs)

Presentation Transcript

The IEP: Progress Monitoring Process

Context Free Grammars

Introduction to Syntax and Context-Free Grammars www1.cs.columbia/~rambow/teaching/lecture-2009-09-22

Context-aware Computing: Basic Concepts

Compilation 0368 - 3133 (Semester A, 2013/14)

Compilation 0368-3133 (Semester A, 2013/14)

Compilation 0368-3133 (Semester A, 2013/14)

Chapter 4

Compilation 0368-3133 (Semester A, 2013/14)

Grammars

Theory of Compilation 236360 Erez Petrank

Theory of Compilation 236360 Erez Petrank

Turing Machines (11.5) Longin Jan Latecki Temple University

From 06-07 DPI 3rd grade indicators

New Form 990 – What It Means to Your Organization!

Transforming Context-Free Grammars to Chomsky Normal Form

SQL Unit 17 Normalization

CURVE TRACING

OJ Simpson Case Study Compilation

Jeff Edmonds York University

Theory of Compilation 236360 Erez Petrank