210 likes | 599 Views
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages. Instructor: Li Ma Department of Computer Science Texas Southern University, Houston. January, 2008. Review and Preview. Last lecture Introduction to programming languages Fundamental concepts
E N D
Programming Languages and DesignLecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern University, Houston January, 2008
Review and Preview • Last lecture • Introduction to programming languages • Fundamental concepts • Computation models • Programming models/paradigms • Program processing • Today’s lecture • Syntax specifications of programming languages • Reference: Chapter 4 of “Foundations of Programming Languages: Design and Implementation”, S. H. Roosta • Three mechanisms: regular expressions, formal grammars, attribute grammars
Language Description • A formal language is any set of character strings with characters chosen from a fixed, finite set of an alphabet of symbols • The strings that belong to the language are called its constructs, or phrases • Any programming language description can be classified according to its • Syntax, which deals with the formation of phrases • Semantics, which deals with the meaning of phrases • Pragmatics, which deals with the practical use of phrases
Syntax • Syntax refers to the formation of constructs in the language and defines relations between them • It describes the structure of the language without addressing the meaning of the constructs of the language • Syntax of a programming language is similar to the grammar of a natural language • Three mechanisms describe the design and implementation of programming languages • Regular expressions • Formal grammars • Attribute grammars
Regular Expressions • Invented by Stephen Kleene in about 1950 • Represent a form of language definition • Each regular expression E denotes some language L(E) defined over the alphabet of the language • Defined by the following set of rules • Alternation • If a and b are regular expressions, then so is (a+b) • The language defined by (a+b) has all the strings from the language identified by a and all strings from the language identified by b • Concatenation • If a and b are regular expressions, then so is (a*b) • The language defined by (a*b) has all the strings formed by concatenating a string from the set of strings identified by a to the end of a string in the set identified by b
Regular Expressions (cont’) • Defined by the following set of rules (cont’) • Kleene closure • If a is a regular expression, then so is a* • The defined language of a* consists of all the strings formed by concatenating zero or more strings in the language identified by a • Positive closure • If a is a regular expression, then so is a+ • The defined language of a+ consists of all the strings formed by concatenating one or more strings in the language identified by a • a+ is the same as a* except that ε is excluded • Empty • ø is a regular expression and defined language consisting of no strings • Atom • any single symbol such as a or ε is a regular expression with a defined language consisting of the single string {a} or {ε}
Formal Grammars • A grammar is a notation that you can use to specify a structural description of the various constructs in the language • Four components of the grammar of a programming language • Terminal symbols • Variable symbols (nonterminal) • Production rules • Start symbol
Production Rules • Each production rule has • symbols as its left side • the symbol => • a string over the set of terminals and variables as its right side • A production rule indicates that the left-side symbols drive or simply imply the right-side symbols • Derivation begins with the start symbol • Each successive string in the sequence derived from the preceding string
Definitions for Grammar • The grammar of a programming language can be defined as a quadruple, G = (T, V, P, S) • T is a finite set of terminal symbols, lowercase characters • V is a finite set of variable symbols (V∩T = ø), uppercase characters • P is a finite set of production rules of the form α.X.β => δ, where α, β, and δ in (VUT)* and X in V • S in V is the start symbol of the phrase • Two grammars, G1 and G2, are equivalent if and only if L(G1) = L(G2)
Classification of Grammars • Type 0: unrestricted grammar • Requires at least one nonterminal symbol on the left side of a production rule • Form α => β, where α in (VUT)+ and β in (VUT)* • Recursively enumerable grammar, or phrase structured grammar • Type 1: context-sensitive grammar • Requires that the right side of a production rule have no fewer symbols than the left side • Form α => β, where α = δ1Aδ2, β = δ1ωδ2, A in V, ω in (VUT)+ and δ1, δ2 in (VUT)*
Classification of Grammars (cont’) • Type 2: context-free grammar • Requires that the left side of a production rule be a single variable symbol and the right side be a combination of terminal and variable terminals • Form A => α, where A in V and α in (VUT)* • Backus-Naur Form (BNF) grammar • Equivalent to context-free grammar • Differ only in the notation • Nonterminal enclosed by < > • The symbol ::= is used for derivation
Classification of Grammars (cont’) • Type 3: regular grammar • Restricted to only one terminal or one terminal and one variable on the right side of a production rule • Restrictive grammar • Right-linear grammar • Form A => xB or A => x, where A, B in V, x in T • Rightmost derivation • Left-linear grammar • Form A => Bx or A => x, where A, B in V, x in T • Leftmost derivation
Syntax Tree • Two parts of programming language syntax • Lexical syntax: describes the smallest units with significance, called tokens • Phrase-structure syntax: explains how tokens are arranged into programs • The syntactic structure of a phrase can be represented with a syntax tree (derivation tree or parse tree) • Terminal nodes – terminal symbols • Internal nodes – variable symbols • Root – start symbol • The label of an internal node – left side of the production rule; the labels of the children of the node (from left to right) – right side of the production rule
Syntax Tree (cont’) • Recognition/representation • Determining whether the phrase is syntactically valid • Production rules are used to construct a syntax tree • The grammar-oriented compiling technique consists of two components • A lexical analyzer: convert the stream of input characters to a stream of tokens • A syntactic analyzer: form a derivation tree from the token list, is a combination of • A parser • An intermediate code generator
Parsers • Parsing: deriving the parse tree • Two basic approaches to deriving parse trees • Top-down parsers • Begin with the start symbol as the root of the tree • Repeatedly replace variable symbols with a string of terminal symbols • Bottom-up parsers • Begin with a string of terminal symbols • Repeatedly replace sequences in the string with variable symbols • The process continues until the start symbol is produced • In both cases, the tree is the result of a syntactic analysis of the grammar
Ambiguity • Ambiguous grammar: A grammar represents a phrase of its language in two or more derivation tree • Due to lack of syntactic structure • Should eliminate ambiguity whenever possible • Revise the grammar • Introduce a disambiguity rule
BNF Variations • Other notational variations • Example: Notation { … }ij can be used to express any number n of occurrences of the enclosed sequence of symbols, for i≤n≤j • Extended BNF grammar • Add some extra notations to allow easier description of languages • Anything that can be specified with BNF can also be specified with Extended BNF (EBNF) grammar • Increases the readability and writability of the production rules • Syntax diagram • A pictorial technique, equivalent to BNF grammar • In this approach, each production rule is represented as a directed graph whose vertices are symbols • Terminal symbols: circles • Variable symbols: rectangles
Attribute Grammars • Developed by Donald Knuth in 1968 • Powerful and elegant mechanisms that formalize both the context-free and context-sensitive aspects of a language’s syntax • Can be used to determine whether a variable has been declared and whether the use of the variable is consistent with its declaration • An extension to a context-free grammar with certain formal primitives • enable syntax aspects of a language to be specified more precisely