Syntax and Semantics

Structure of programming languages Syntax and Semantics

Introduction • Syntax: the form or structure of the expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units

Introduction • Syntax and semantics provide a language’s definition • Users of a language definition • Other language designers • Implementers • Programmers (the users of the language)

Syntax of a Languages • Motivation for using a subset of C: Grammar Language (pages) Reference Pascal 5 Jensen & Wirth C 6 Kernighan & Richie C++ 22 Stroustrup Java 14 Gosling, et. al. • The grammar fits on one page, so it’s a far better tool for studying language design.

C Grammar: Statements Program int main ( ){ Declarations Statements } Declarations  {Declaration } Declaration  Type Identifier [ [ Integer ] ]{ , Identifier [ [ Integer ] ] }; Type  int | bool | float | char Statements  {Statement } Statement  ; | Block | Assignment | IfStatement | WhileStatement Block  { Statements } Assignment  Identifier [ [ Expression ] ]=Expression ; IfStatement  if (Expression ) Statement [else Statement ] WhileStatement  while (Expression ) Statement

C Grammar: Expressions Expression  Conjunction {|| Conjunction } Conjunction  Equality {&& Equality } Equality  Relation [ EquOp Relation ] EquOp  == | != Relation  Addition [ RelOp Addition ] RelOp  < | <= | > | >= Addition  Term { AddOp Term } AddOp  + | - Term  Factor { MulOp Factor } MulOp  * | / | % Factor  [ UnaryOp ]Primary UnaryOp  - | ! Primary  Identifier [[ Expression ]] | Literal | ( Expression ) | Type ( Expression )

Describing Syntax: Terminology • A sentenceis a string of characters over some alphabet • A language is a set of sentences • Alexemeis the lowest level syntactic unit of a language (e.g., *, sum, begin) • A tokenis a category of lexemes (e.g., identifier)

Formal Definition of Languages • Recognizers • A recognition device reads input strings of the language and decides whether the input strings belong to the language • Example: syntax analysis part of a compiler • Generators • A device that generates sentences of a language • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator

Lexical Analysis • The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. • In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) )‏ • Similarly, in Java, we tokenizeSystem.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;

Lexical Analysis • Lexical analyzer splits it into tokens • Token = sequence of characters (symbolic name) representing a single terminal symbol • Identifiers: myVariable … • Literals: 123 5.67 true … • Keywords: char sizeof … • Operators: + - * / … • Punctuation: ; , } { … • Discards whitespace and comments

Examples of Tokens in C

Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens

program confusing; const true = false; begin if (a<b) = true then f(a) else … Redefining Identifiers can be dangerous

Parsing Process • Call the scanner to get tokens • Build a parse tree from the stream of tokens • A parse tree shows the syntactic structure of the source program. • Add information about identifiers in the symbol table • Report error, when found, and recover from the error

Describing Syntax • Backus-Naur Form and Context-Free Grammars (BNF) • Most widely known method for describing programming language syntax • Extended BNF • Improves readability and writability of BNF

Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • BNF is equivalent to context-free grammars • BNF is a meta-language used to describe another language • In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)

Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives. Example: Binary Digits

Example: Decimal Numbers • Grammar for unsigned decimal integers • Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • Non-terminal symbols: Digit, Integer • Production rules: • Integer  Digit | Integer Digit • Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Start symbol: Integer • Can derive any unsigned integer using this grammar • Language = set of all unsigned decimal integers

Derivation of 352 as an Integer • Production rules: • Integer  Digit | Integer Digit • Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer  Integer Digit  Integer 2  Integer Digit 2  Integer 5 2  Digit 5 2  3 5 2 • Rightmost derivation At each step, the rightmost non-terminal is replaced

Parse Tree for 352 as an Integer

Parse of the String 5-4+3

BNF Fundamentals • Non-terminals: BNF abstractions • Terminals: lexemes and tokens • Grammar: a collection of rules • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>

Regular Expressions • x character x • \x escaped character, e.g., \n • { name } reference to a name • M | N M or N • M N M followed by N • M* 0 or more occurrences of M • M+ 1 or more occurrences of M • [x1 … xn] One of x1 … xn • Example: [aeiou] – vowels, [0-9] - digits

Derivation • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const

Parse Tree • A hierarchical representation of a derivation <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b

CFG For Floating Point Numbers ::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:

Associativity and Precedence • A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . • Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )

E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 E E + E F F F * F id1 id2 id3 Precedence E E + E E  E - E E  E * E E  E / E E  id E E + E E  E - E E  F F F * F F F / F Fid

Precedence Associativity Operators 3 right ** 2 left * / % \ 1 left + - Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level. Associativity and Precedence for Grammar

Associativity of Operators • Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const

E F X / F id4 * X F X id3 ( E ) + E F F X X id2 id1 Associativity • Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)

Ambiguity in Grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees

With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one! Example of Dangling Else

An Ambiguous Expression Grammar <expr>  <expr> <op> <expr> | const <op>  / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const

The Dangling Else Ambiguity

Solving the dangling else ambiguity • Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. • Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) • Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression )StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.

Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z

An Unambiguous Expression Grammar • If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr>  <expr> - <term> | <term> <term>  <term> / const| const <expr> <expr> - <term> <term> <term> / const const const

Extended BNF • Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term>(+|-) const • Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}

BNF and EBNF • BNF <expr>  <expr> + <term> | <expr> - <term> | <term> <term>  <term> * <factor> | <term> / <factor> | <factor> • EBNF <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>}

Finite State Automata • Set of states • Usually represented as graph nodes • Input alphabet + unique “end of program” symbol • State transition function • Usually represented as directed graph edges (arcs) • Automaton is deterministic if, for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol • Unique start state • One or more final (accepting) states

DFA for C Identifiers

Syntax Diagram for Expressions with Addition

Syntax and Semantics