180 likes | 196 Views
CS 3304 Comparative Languages. Lecture 3: Scanning 24 January 2012. Introduction. Syntax: the form or structure of the expressions, statements, and program units. Semantics: the meaning of the expressions, statements, and program units.
E N D
CS 3304Comparative Languages • Lecture 3:Scanning • 24 January 2012
Introduction • Syntax: the form or structure of the expressions, statements, and program units. • Semantics: the meaning of the expressions, statements, and program units. • Syntax and semantics provide a language’s definition. • Users of a language definition: • Other language designers. • Implementers. • Programmers (the users of the language). • Basic terminology: • A sentence is a string of characters over some alphabet. • A language is a set of sentences. • A lexeme is the lowest level syntactic unit of a language. • A token is a category of lexemes (e.g., identifier).
Defining Languages • Recognizers: • A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. • Example: syntax analysis part of a compiler (scanning). • Generators: • A device that generates sentences of a language. • One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.
Regular Expressions • A regular expression is one of the following: • A character. • The empty string, denoted by ε. • Two regular expressions concatenated. • Two regular expressions separated by | (i.e., or). • A regular expression followed by the Kleene star (concatenation of zero or more strings). • Numerical literals in Pascal may be generated by the following:
Context-Free Grammars • Context-Free Grammars: • Developed by Noam Chomsky in the mid-1950s. • Language generators, meant to describe the syntax of natural languages. • Define a class of languages called context-free languages. • Backus-Naur Form (1959): • Invented by John Backus to describe Algol 58. • BNF is equivalent to context-free grammars (CFGs). • A CFG consists of: • A set of terminals T. • A set of non-terminals N. • A start symbol S (a non-terminal). • A set of productions.
BNF Fundamentals • In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just terminals). • Terminals are lexemes or tokens. • A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. • Nonterminals are often italic or enclosed in angle brackets. • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> • Grammar: a finite non-empty set of rules. • A start symbol is a special element of the nonterminals of a grammar.
Context-Free Grammar Example • Expression grammar with precedence and associativity:
Parse Tree Example 1 • Parse tree for expression grammar (with precedence) for 3 + 4 * 5
Parse Tree Example 2 • Parse tree for expression grammar (with left associativity) for10 - 4 - 3
Using ANTLR • Syntax similar to CFG. • Non-terminal symbols: lower case letters. • Terminal symbols: upper case letters. • An example of rule syntax (parsing):expr: ID | NUMBER | '-' expr | '(' expr ')' | expr OP expr; • An example of rules used for tokens (scanning):OP: '+' | '-' | '*' | '/';
ANTLR Grammar for Example 2.8 grammar Example2b; expr: term | expr ADD_OP term; term: factor | term MULT_OP factor; factor: ID | NUMBER | '-' factor | '(' expr ')' ; ADD_OP: '+' | '-' ; MULT_OP: | '*' | '/'; ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*; NUMBER: INTEGER | REAL; fragment INTEGER : '0'..'9'+; REAL : ('0'..'9')+ '.' ('0'..'9')* EXPONENT? | '.' ('0'..'9')+ EXPONENT? | ('0'..'9')+ EXPONENT ; EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Scanner Responsibilities • Tokenizing source. • Removing comments. • (Often) dealing with pragmas (i.e., significant comments). • Saving text of identifiers, numbers, strings. • Saving source locations (file, line, column) for error messages.
Scanning Example I • Suppose we are building an ad-hoc (hand-written) scanner for Pascal: • We read the characters one at a time with look-ahead. • If it is one of the one-character tokens: { ( ) [ ] < > , ; = + - etc }we announce that token. • If it is a ., we look at the next character: • If that is a dot, we announce . • Otherwise, we announce . and reuse the look-ahead.
Scanning Example II • If it is a <, we look at the next character • if that is a = we announce <= • otherwise, we announce < and reuse the look-ahead, etc. • If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore: • Then we check to see if it is a reserve word. • If it is a digit, we keep reading until we find a non-digit: • If that is not a . we announce an integer. • Otherwise, we keep looking for a real number. • If the character after the . is not a digit we announce an integer and reuse the . and the look-ahead.
Deterministic Finite Automaton • Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. • This is a deterministic finite automaton (DFA): • Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. • Specifically, they construct a machine that accepts the language.
The Longest Possible Token Rule • We scanover and over to get one token after another. • Nearly universal rule: always take the longest possible token from the input, thus:foobar is foobar and never f or foo or foob. • The rule means you return only when the next character can't be used to continue the current token: • The next character will generally be saved for the next token. • In some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed: • In Pascal, for example, when you have a 3 and you a see a dot • Do you proceed (in hopes of getting 3.14)? or • Do you stop (in fear of getting 3..5)? • Regular expressions "generate" a regular language. • DFAs "recognize” a regular language.
Building Scanners • Scanners tend to be built three ways: • Ad-hoc. • Semi-mechanical pure DFA (usually as nested case statements). • Table-driven DFA. • Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. • Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): • It is often easier to use perl, awk, sed or similar tools. • Table-driven DFA is what lex and scangen produce: • lex (flex): C code • scangen: numeric tables and a separate driver (Figure 2.12). • ANTLR: Java code.
Summary • BNF and context-free grammars are equivalent meta-languages that are well-suited for describing the syntax of programming languages. • Syntax analysis is a common part of language implementation • Scanners (lexical analyzers) use pattern matching to isolate small-scale parts of a program. • ANTLR provides supports for scanners (lexers), parsers, and tree-parsers.