430 likes | 551 Views
CSE305 Programming Languages. Syntax What is it? How is it specified? Who uses it? Why is it needed?. Note. These notes are based on the Sebesta text.
E N D
CSE305Programming Languages Syntax What is it? How is it specified? Who uses it? Why is it needed?
Note • These notes are based on the Sebesta text. • The tree diagrams in these slides are from the lecture slides provided in the instructor resources for the text, and were made by David Garrett.
Introduction: syntax and semantics • Syntax: a formal description of the structure of programs in a given language. • Semantics: a formal description of the meaning programs in a given language. • Together the syntax and semantics define a language.
Who uses a language definition? • Those who design a language • Those who implement a language (e.g. write compilers for it) • The who use the language (i.e. software developers) • Those who make tools for developers (e.g. JDT in Eclipse)
Language & grammar • A given language can have more than one grammar which describes it. • The grammar presented to a user is not necessarily the same as the grammar used in an implementation. • implementation requires a very detailed grammar • user needs a human-readable grammar
Syntax and semantics of programming languages • I have cautioned against getting too hung up on the syntax of a programming language. • But, you still need to learn the syntax of any language you work with so that you can read and write programs in the language. • To understand the meaning of programs expressed in a language you also have know the semantics of the language.
General background • Chomsky hierarchy • Context-free grammars • Backus-Naur form
Chomsky hierarchy • Noam Chomsky defined a hierarchy of grammars and languages known as the Chomsky hierarchy: • regular languages (most restrictive) • context-free languages • context-sensitive languages • unrestricted languages (least restrictive)
Context-free (CF) grammar • A CF grammar is formally presented as a 4-tuple G=(T,NT,P,S), where: • T is a set of terminal symbols (the alphabet) • NT is a set of non-terminal symbols • P is a set of productions (or rules), where PÎNT´(TÈNT)* • SÎNT
Example 1A small formal language L1 = { 0, 00, 1, 11 } G1 = ( {0,1}, {S}, { S0, S00, S1, S11 }, S )
Example 2A small fragment of English L2 = { the dog chased the dog, the dog chased a dog, a dog chased the dog, a dog chased a dog, the dog chased the cat, … } G2 = ({a, the, dog, cat, chased}, {S, NP, VP, Det, N, V}, {S NP VP, NP Det N, Det a | the, N dog | cat, VP V | V NP, V chased}, S ) Notes: S = Sentence, NP = Noun Phrase , N = Noun VP = Verb Phrase, V = Verb, Det = Determiner
Language terminology(from Sebesta, p. 125) • A language is a set of strings of symbols, drawn from some finite set of symbols (called the alphabet of the language). • “The strings of a language are called sentences” • “Formal descriptions of the syntax […] do not include descriptions of the lowest-level syntactic units […] called lexemes.” • “A token of a language is a category of its lexemes.” • Syntax of a programming language is often presented in two parts: • regular grammar for token structure (e.g. structure of identifiers) • context-free grammar for sentence structure
Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe ALGOL 58, modified by Peter Naur for ALGOL 60 • BNF is equivalent to context-free grammar • BNF is a metalanguage used to describe another language, the object language • Extended BNF: adds syntactic sugar to produce more readable descriptions
BNF Fundamentals • Sample rules [p. 128] <assign> → <var> = <expression> <if_stmt> → if <logic_expr> then <stmt> <if_stmt> → if <logic_expr> then <stmt> else <stmt> • non-terminals/tokens surrounded by < and > • lexemes are not surrounded by < and > • keywords in language are in bold • → separates LHS from RHS • | expresses alternative expansions for LHS <if_stmt> → if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt> • = is in this example a lexeme
BNF Rules • A rule has a left-hand side (LHS) and a right-hand side (RHS), and consists of terminal and nonterminal symbols • A grammar is often given simply as a set of rules (terminal and non-terminal sets are implicit in rules, as is start symbol)
Describing Lists • There are many situations in which a programming language allows a list of items (e.g. parameter list, argument list). • Such a list can typically be as short as empty or consisting of one item. • Such lists are typically not bounded. • How is their structure described?
Describing lists • The are described using recursive rules. • Here is a pair of rules describing a list of identifiers, whose minimum length is one: <ident_list> -> ident | ident , <ident_list> • Notice that ‘,’ is part of the object language (the language being described by the grammar).
Derivation of sentences from a grammar • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols)
Recall example 2 G2 = ({a, the, dog, cat, chased}, {S, NP, VP, Det, N, V}, {S NP VP, NP Det N, Det a | the, N dog | cat, VP V | VP NP, V chased}, S)
Example: derivation from G2 • Example: derivation of the dog chased a cat S NP VP Det N VP the N VP the dog VP the dog V NP the dog chased NP the dog chased Det N the dog chased a N the dog chased a cat
Example 3 L3 = { 0, 1, 00, 11, 000, 111, 0000, 1111, … } G3 = ( {0, 1}, {S, ZeroList, OneList}, {S ZeroList | OneList, ZeroList 0 | 0 ZeroList, OneList 1 | 1 OneList }, S )
Example: derivations from G3 • Example: derivation of 0 0 0 0 S ZeroList 0 ZeroList 0 0 ZeroList 0 0 0 ZeroList 0 0 0 0 • Example: derivation of 1 1 1 S OneList 1 OneList 1 1 OneList 1 1 1
Observations about derivations • Every string of symbols in the derivation is a sentential form. • A sentence is a sentential form that has only terminal symbols. • A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded. • A derivation can be leftmost, rightmost, or neither.
An example programming language grammar fragment <program> -> <stmt-list> <stmt-list> -> <stmt> | <stmt> ; <stmt-list> <stmt> -> <var> = <expr> <var> -> a | b | c | d <expr> -> <term> + <term> | <term> - <term> <term> -> <var> | const
A leftmost derivation ofa = b + const <program> => <stmt-list> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Parse tree • A parse tree is an hierarchical representation of a derivation: <program> <stmt-list> <stmt> <var> = <expr> a <term> + <term> <var> const b
Parse trees and compilation • A compiler builds a parse tree for a program (or for different parts of a program). • If the compiler cannot build a well-formed parse tree from a given input, it reports a compilation error. • The parse tree serves as the basis for semantic interpretation/translation of the program.
Extended BNF • Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> -> <term>(+|-) const • Repetitions (0 or more) are placed inside braces { } <ident> -> letter {letter|digit}
Comparison of BNF and EBNF • sample grammar fragment expressed in BNF <expr> -> <expr> + <term> | <expr> - <term> | <term> <term> -> <term> * <factor> | <term> / <factor> | <factor> • same grammar fragment expressed in EBNF <expr> -> <term> {(+ | -) <term>} <term> -> <factor> {(* | /) <factor>}
Ambiguity in grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees • Operator precedence and operator associativity are two examples of ways in which a grammar can provide an unambiguous interpretation.
Operator precedence ambiguity The following grammar is ambiguous: <expr> -> <expr> <op> <expr> | const <op> -> / | - The grammar treats the '/' and '-' operators equivalently.
An ambiguous grammarfor arithmetic expressions <expr> -> <expr> <op> <expr> | const <op> -> / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
Disambiguating the grammar • If we use the parse tree to indicate precedence levels of the operators, we can remove the ambiguity. • The following rules give / a higher precedence than - <expr> -> <expr> - <term> | <term> <term> -> <term> / const | const <expr> <expr> - <term> <term> <term> / const const const
Links to BNF-style grammars for actual programming languages Below are some links to grammars for real programming languages. Look at how the grammars are expressed. • http://www.schemers.org/Documents/Standards/R5RS/ • http://www.sics.se/isl/sicstuswww/site/documentation.html In the ones listed below, find the parts of the grammar that deal with operator precedence. • http://java.sun.com/docs/books/jls/index.html • http://www.lykkenborg.no/java/grammar/JLS3.html • http://www.enseignement.polytechnique.fr/profs/informatique/Jean-Jacques.Levy/poly/mainB/node23.html • http://www.lrz-muenchen.de/~bernhard/Pascal-EBNF.html
<expression> <assignment-expression> Derivation of2+5*3using C grammar <conditional-expression> <logical-OR-expression> <logical-AND-expression> <inclusive-OR-expression> <exclusive-OR-expression> <AND-expression> <equality-expression> <relational-expression> <shift-expression> <additive-expression> + <additive-expression> <multiplicative-expression> <multiplicative-expression> <multiplicative-expression> <cast-expression> * <cast-expression> <unary-expression> <cast-expression> <unary-expression> <postfix-expression> <unary-expression> <postfix-expression> <primary-expression> <postfix-expression> <primary-expression> <constant> <primary-expression> <constant> 3 <constant> 2 5
Recursion and parentheses • To generate 2+3*4 or 3*4+2, the parse tree is built so that + is higher in the tree than *. • To force an addition to be done prior to a multiplication we must use parentheses, as in (2+3)*4. • Grammar captures this in the recursive case of an expression, as in the following grammar fragment: <expr> <expr> + <term> | <term> <term> <term> * <factor> | <factor> <factor> <variable> | <constant> | “(” <expr> “)”
Associativity of operators • When multiple operators appear in an expression, we need to know how to interpret the expression. • Some operators (e.g. +) are associative, meaning that the meaning of an expression with multiple instances of the operator is the same no matter how it is interpreted: (a+b)+c = a+(b+c) • Some operators (e.g. -) are not associative: (a-b)-c ¹ a-(b-c) e.g. try a=10, b=8, c=6 (10-8)-6 = -4 but 10-(8-6)=8 • - and / are both left-associative, meaning a-b-c is interpreted as (a-b)-c. • Exponentiation (**) is right-associative. This means that 2**3**2 is interpreted as 2**(3**2) (i.e. 2**9) rather than (2**3)**2 (i.e. 8**2 or 2**6).
Associativity of Operators • Operator associativity can be encoded by a grammar. The following grammar fragment does not do this: the left and right operands of '-' are treated symmetrically. <expr> -> <expr> - <expr> | <term> <term> -> <var> | <const> | “(” <expr> “)” <expr> <expr> <expr> <expr> <expr> - <expr> <expr> - <expr> <expr> <expr> <expr> <expr> - - <term> <term> <term> <term> <term> <term>
Associativity of Operators • However, the following rules ensure that '-' is left-associative, because they prevent direct recursion with '-' in the right-hand operand. <expr> -> <expr> - <term> | <term> <term> -> <var> | <const> | “(” <expr> “)” <expr> <expr> <expr> - <term> <expr> <term> - <term>
Decision timing:Design timevs.Implementation time • (to come) • Java and precedence/associativity/left-to-right evaluation vs. C++ (?)
Dealing with fixed-size numeric representations Theoryvs.Reality • (to come) • Java/C# vs. C/C++ (size of representation – but this is not the slide to address this on: see next point). • Also, effect of fixed size of representations on associativity: • mathematically, (x+y)+z = x+(y+z) • in practice (+ is not always associative): • (large+small)+small = large • large+(small+small) > large