460 likes | 592 Views
Structure of programming languages. Syntax and Semantics. Introduction. Syntax : the form or structure of the expressions, statements, and program units Semantics : the meaning of the expressions, statements, and program units. Introduction.
E N D
Structure of programming languages Syntax and Semantics
Introduction • Syntax: the form or structure of the expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units
Introduction • Syntax and semantics provide a language’s definition • Users of a language definition • Other language designers • Implementers • Programmers (the users of the language)
Syntax of a Languages • Motivation for using a subset of C: Grammar Language (pages) Reference Pascal 5 Jensen & Wirth C 6 Kernighan & Richie C++ 22 Stroustrup Java 14 Gosling, et. al. • The grammar fits on one page, so it’s a far better tool for studying language design.
C Grammar: Statements Program int main ( ){ Declarations Statements } Declarations {Declaration } Declaration Type Identifier [ [ Integer ] ]{ , Identifier [ [ Integer ] ] }; Type int | bool | float | char Statements {Statement } Statement ; | Block | Assignment | IfStatement | WhileStatement Block { Statements } Assignment Identifier [ [ Expression ] ]=Expression ; IfStatement if (Expression ) Statement [else Statement ] WhileStatement while (Expression ) Statement
C Grammar: Expressions Expression Conjunction {|| Conjunction } Conjunction Equality {&& Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ]Primary UnaryOp - | ! Primary Identifier [[ Expression ]] | Literal | ( Expression ) | Type ( Expression )
Describing Syntax: Terminology • A sentenceis a string of characters over some alphabet • A language is a set of sentences • Alexemeis the lowest level syntactic unit of a language (e.g., *, sum, begin) • A tokenis a category of lexemes (e.g., identifier)
Formal Definition of Languages • Recognizers • A recognition device reads input strings of the language and decides whether the input strings belong to the language • Example: syntax analysis part of a compiler • Generators • A device that generates sentences of a language • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator
Lexical Analysis • The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. • In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) ) • Similarly, in Java, we tokenizeSystem.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;
Lexical Analysis • Lexical analyzer splits it into tokens • Token = sequence of characters (symbolic name) representing a single terminal symbol • Identifiers: myVariable … • Literals: 123 5.67 true … • Keywords: char sizeof … • Operators: + - * / … • Punctuation: ; , } { … • Discards whitespace and comments
Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
program confusing; const true = false; begin if (a<b) = true then f(a) else … Redefining Identifiers can be dangerous
Parsing Process • Call the scanner to get tokens • Build a parse tree from the stream of tokens • A parse tree shows the syntactic structure of the source program. • Add information about identifiers in the symbol table • Report error, when found, and recover from the error
Describing Syntax • Backus-Naur Form and Context-Free Grammars (BNF) • Most widely known method for describing programming language syntax • Extended BNF • Improves readability and writability of BNF
Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • BNF is equivalent to context-free grammars • BNF is a meta-language used to describe another language • In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)
Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives. Example: Binary Digits
Example: Decimal Numbers • Grammar for unsigned decimal integers • Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • Non-terminal symbols: Digit, Integer • Production rules: • Integer Digit | Integer Digit • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Start symbol: Integer • Can derive any unsigned integer using this grammar • Language = set of all unsigned decimal integers
Derivation of 352 as an Integer • Production rules: • Integer Digit | Integer Digit • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer Integer Digit Integer 2 Integer Digit 2 Integer 5 2 Digit 5 2 3 5 2 • Rightmost derivation At each step, the rightmost non-terminal is replaced
The following grammar defines the language of arithmetic expressions with 1-digit integers, addition, and subtraction. Expr Expr + Term | Expr – Term | Term Term 0 | ... | 9 | ( Expr ) Arithmetic Expression Grammar
BNF Fundamentals • Non-terminals: BNF abstractions • Terminals: lexemes and tokens • Grammar: a collection of rules • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>
Regular Expressions • x character x • \x escaped character, e.g., \n • { name } reference to a name • M | N M or N • M N M followed by N • M* 0 or more occurrences of M • M+ 1 or more occurrences of M • [x1 … xn] One of x1 … xn • Example: [aeiou] – vowels, [0-9] - digits
An Example Grammar <program> <stmts> <stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | const
Derivation • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Parse Tree • A hierarchical representation of a derivation <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
CFG For Floating Point Numbers ::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:
Associativity and Precedence • A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . • Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )
E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 E E + E F F F * F id1 id2 id3 Precedence E E + E E E - E E E * E E E / E E id E E + E E E - E E F F F * F F F / F Fid
Precedence Associativity Operators 3 right ** 2 left * / % \ 1 left + - Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level. Associativity and Precedence for Grammar
Associativity of Operators • Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const
E F X / F id4 * X F X id3 ( E ) + E F F X X id2 id1 Associativity • Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)
Ambiguity in Grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees
With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one! Example of Dangling Else
An Ambiguous Expression Grammar <expr> <expr> <op> <expr> | const <op> / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
Solving the dangling else ambiguity • Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. • Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) • Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression )StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.
Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z
An Unambiguous Expression Grammar • If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr> <expr> - <term> | <term> <term> <term> / const| const <expr> <expr> - <term> <term> <term> / const const const
Extended BNF • Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term>(+|-) const • Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}
BNF and EBNF • BNF <expr> <expr> + <term> | <expr> - <term> | <term> <term> <term> * <factor> | <term> / <factor> | <factor> • EBNF <expr> <term> {(+ | -) <term>} <term> <factor> {(* | /) <factor>}
Finite State Automata • Set of states • Usually represented as graph nodes • Input alphabet + unique “end of program” symbol • State transition function • Usually represented as directed graph edges (arcs) • Automaton is deterministic if, for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol • Unique start state • One or more final (accepting) states
Syntax Diagram for Expressions with Addition