1.44k likes | 1.47k Views
Exploring the syntax and semantics of programming languages, defining languages and sentences, lexemes, tokens, and language recognition and generation methods easily. Informative guide for beginners.
E N D
Chapter 3 Describing Syntax and Semantics
Syntax and Semantics • Syntax of a programming language: the form of its expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units • Syntax and semantics provide a language’s definition
An Example of Syntax and Semantic • The syntax of a Java while statement: while (<boolean_expr>) <statement> • The semantic of the above statement: when the current value of the Boolean expression is true, the embedded statement is executed; otherwise, control continues after the while construct. Then control implicitly returns to the Boolean expression to repeat the process
Description Easiness • Describing syntax is easier than describing semantics • partly because a concise and universally accepted notation is available for syntax description • but none has yet been developed for semantics
character string character string character string character string character string character string Definitions of Languages and Sentences • A language, whether natural (such as English) or artificial (such as Java), is a set of strings of characters from some alphabet. • The strings of a language are called sentences or statements. language
Syntax Rules of a Language • The syntax rules of a language specify which strings of characters from the language’s alphabet are in the language.
Definition of Literal [Word Origin & History] • A letter or symbol that stands for itself as opposed to a feature or function in a programming language: • Example: • $ can be a symbol that refers to the end of a line, but as a literal, it is a dollar sign.
Definitions of Lexemes • The lexeme of a programming language include its • numeric literals • operators • and special words, among others. • (e.g., *, sum, begin) • Alexemeis the lowest level syntactic unit of a language: • for simplicity’s sake, formal descriptions of the syntax of programming languages often do not include description of them • However the description of lexemes can be given by a lexical specification which is usually separate from the syntactic description of the language.
An Informal Definition of a Program • A program could be thought as a string of lexemes. (P.S.: Ignore the definition of the text book.) … lexeme lexeme lexeme … lexeme lexeme lexeme lexeme string … lexeme lexeme lexeme … lexeme lexeme lexeme : lexeme lexeme lexeme lexeme lexeme lexeme program
Lexeme Group • Lexemes are partitioned into groups. • For example, in a programming language, the names of variables, methods, classes, and so forth form a group called identifier. • Each of these groups is represented by a name, or token. lexeme lexeme : lexeme lexeme token
Definitions of Tokens • A tokenis a category of lexemes (e.g., identifier) • a token may have only a single lexeme • For example, the token for the arithmetic operator symbol +, which may have the name plus_op, has just one possible lexeme. • a lexeme of a token can also be called an instance of the token
Example of Tokens • Identifier is a token that can have lexemes, or instances, such as sum and total.
Lexemes and Tokens [csusb], [Lee] • Lexemes: a string of characters in a language that is treated as a single unit in the syntax and semantics. • For example identifiers, numbers, and operators are often lexemes • In a programming language, there are a very large number of lexemes, perhaps even an infinite number; however, there are only a small number of tokens.
Examples of Lexemes and Tokens [Lee] • while (y <= t) y = y - 3 ; will be represented by the set of pairs: a token with only one lexeme
Summary language program character string token lexeme character
Ways to Define a Language • In general, languages can be formally defined in two distinct ways: by recognition and by generation. • P.S.: Although neither provides a definition that is practical by itself for people trying to learn or even use a programming language.
Language Recognizer • a recognition device that • reads strings of characters from the alphabet of a language and • decides whether an input string was or was not in the language • is not used to enumerate all of the sentences of a language • Example: • syntax analysis part of a compiler is a recognizer for the language the compiler translates
Language Generators • a device that generates the sentences of a language. • We can think of the generator as having a button that produces a sentence of the language every time it is pushed • However, the particular sentence that is produced by a generator when its button is pushed is unpredictable • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator • People learn a language from examples of its sentences
Grammars [wiki] • In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language — that is, of a set of strings. • Commonly used to describe the syntax of programming languages
Formal Grammars [wiki] • A formal grammar, or sometimes simply grammar, consists of: • a finite set of terminal symbols; • a finite set of nonterminal symbols; • a finite set of production rules with a left- and a right-hand side consisting of a sequence of the above symbols • a start symbol.
Grammar Example [wiki] • Nonterminals are usually represented by uppercase letters • terminals by lowercase letters • the start symbol by S. • For example, the grammar with • terminals {a,b}, • nonterminals {S,A,B}, • production rules • SABS • S ε (where ε is the empty string) • BAAB • BS b • Bb bb • Abab • Aa aa • start symbol S, • defines the language of all words of the form anbn (i.e. n copies of a followed by n copies of b).
Formal Grammars and Formal Languages [wiki] • A formal grammar defines (or generates) a formal language, which is a (possibly infinite) set of sequences of terminal symbolsthat may be constructed by applying production rules to a sequence of symbols which initially contains just the start symbol.
Grammar Classes [wiki] • In the mid-1950s, according to the format of the production rules, Chomsky described four classes of generative devices or grammars that define four classes of languages: • recursively enumerable • context-sensitive • context-free • regular
Application of Context-free and Regular Grammars • The tokens of programming languages can be described by regular grammars. • Whole programming languages, with minor exceptions, can be described by context-free grammars.
Review of Context-Free Grammars • Context-Free Grammars • Developed by Noam Chomsky in the mid-1950s • Language generators, meant to describe the syntax of natural languages • Define a class of languages called context-free languages
Context-Free Grammar [McCallum] • G = <T, N, S, R> • T is a set of terminals • N is a set of non-terminals • S is a start symbol (one of the nonterminals) • R is a set of production rules of the form X → γ where X is a nonterminal and γ is a sequence of terminals and nonterminals (may be empty). • A grammar G generates a language L.
Terminology - metalanguage • A metalanguage is a language that is used to describe another language.
Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • Most widely known method for describing programming language syntax • BNF is equivalent to context-free grammars • BNF is a metalanguage used to describe another language
Extended BNF • Extended BNF • Improves readability and writability of BNF
BNF Abstraction • BNF uses abstractions for syntactic structures • For example, • A simple Javaassignment statementmight be represented by the abstraction <assign>. • (pointed brackets are often used to delimit names of abstractions) • Tokens are also abstractions.
Synonym Nonterminals symbols (nonterminals) BNF abstractions Terminals Lexemes/alphabet of a language
BNF Rules • A rule, also called a production, has a left-hand side (LHS) and a right-hand side (RHS), and consists of terminal and nonterminal symbols. • An abstraction is defined by a rule. • Examples of BNF rules: <ident_list> → identifier | identifier,<ident_list> <if_stmt> → if <logic_expr> then <stmt> • A grammar contains a finite nonempty set of rules.
Multiple Definitions of a Nonterminal • Nonterminal symbols can have two or more distinct definitions, representing two or more possible syntactic forms in the language. • Multiple definitions can be written as a single rule, with different definitions separated by the symbol |.
Example of Multiple Definitions of a Nonterminal • For example, an Adaif statement can be described with the rules: <if_stmt> if <logic_expr> then <stmt> <if_stmt> if <logic_expr> then <stmt> else <stmt> Or with the rule <if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt>
A Rule Example <assign> <var> = <expression> • The above rule specifies that the abstraction <assign> is defined as an instance of the abstraction <var>, followed by the lexeme =, followed by an instance of the abstraction <expression>. • One example whose syntactic structure is described by the rule is: total = subtotal1 + subtotal2
Expressive Capability of BNF • Although BNF is simple, it is sufficiently powerful to describe nearly all of the syntax of programming languages. • BNF can describe • Lists of similar construct • The order in which different constructs must appear • Nested structures to any depth • Imply operator precedence • Imply operator associativity
Variable-length Lists and Recursive Rules • Variable-length lists in mathematics are often written using an ellipsis (…). • 1,2, … is an example. • BNF does not include the ellipsis, it uses recursive rules as an alternative. • A rule is recursive if its LHS appears in its RHS.
Example of a Recursive Rule • Syntactic lists are described using recursion <ident_list> ident | ident, <ident_list> the above rule defines <ident_list> as either a single token (identifier) or an identifier followed by a comma followed by another instance of <ident_list>.
Grammar and Derivation • A grammar is a generative device for defining languages. • The sentences of a languages are created through deviations. • A derivation is a repeated application of rules, starting with a special nonterminal of the grammar called the start symbol and ending with a sentence (all terminal symbols)
An Example of Start Symbols • In a grammar for a complete programming language, the start symbol represents a complete program and is usually named <program>.
Example 3.1 - An Example Grammar • What follows is a grammar for a small language. P.S.:<program> is the start symbol.
Language Defined by the Grammar in Example 3.1 • The language in Example 3.1 has only one statement form: assignment. • A program consists of the special word begin, followed by a list of statements separated by semicolons, followed by the special word end. • An expression is either a single variable, or two variables separated by either a + or – operator. • The only variable names in this language are A, B, and C.
An Example Derivation <program> => begin <stmt_list> end => begin <stmt> ; <stmt_list> end => begin <var> = <expression>; <stmt_list> end => begin A = <expression> ; <stmt_list> end => begin A = <var> + < var > ; <stmt_list> end => begin A = B + <var> ; <stmt_list> end => begin A = B + C ; <stmtjist> end => begin A = B + C ; <stmt> end => begin A = B + C ; <var> := <expression> end => begin A=B + C;B=<expression> end => begin A = B + C ;B= <var> end => begin A = B + C ; B = C end derive
Sentential Form • In the previous slide, each successive string in the sequence is derived from the previous string by replacing one of the nonterminals with one of that nonterminal’s definitions. • Every string of symbols in the derivation is a sentential form.
Order of Derivation • A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded. • The derivation continues until the sentential form contains no nonterminals. • The sentential form, consisting of only terminals, or lexemes, is the generated sentence. • Rightmost derivation • A derivation may be neither leftmost nor rightmost. • However the derivation order has no effect on the language generated by a grammar.
Using the Grammar in Example 3.1 to Generate Different Sentences • By choosing alternative RHSs of rules with which to replace nonterminals in the derivation, different sentences in the language can be generated. • By exhaustively choosing all combinations of choices, the entire language can be generated. • This language, like most others, is infinite, so one cannot generate all the sentences in the language in finite time.
63 Example 3.2 64 • This grammar describes assignment statements whose right sides are arithmetic expressions with multiplication and addition operators and parentheses.
The Leftmost Derivation for A=B*(A+C) <assign> => <id> = <expr> <assign> => <id> = <expr> => A = <expr> => A = <id> * <expr> => A = B * <expr> => A = B * (< expr >) => A = B * (<id> + <expr>) => A = B * (A + <expr>) => A = B * (A + <id>) => A = B * (A + C) is also a sentential form
Parse Trees • One of the most attractive features of grammars is that they naturally describe the hierarchical syntactic structure of the sentences of the languages they define. • These hierarchical structures are called parse trees.
63 Figure 3.1 - a Parse Tree for A=B*(A+C) • The right side figure is a parse tree. • This parse tree shows the structure of the assignment statement derived previously.