290 likes | 469 Views
Computer languages must be precise.Both syntax and semantics must not be ambiguous.Language designers use formal notation for syntactic and semantic definitions.In this class, we are going to discuss the formal syntactic notation.. sc4312. Lecture 2a. 2. Introduction. Formal specification of syn
E N D
1. Lexical Analysis(scanning) SC4312 – Compiler Construction
Originally prepared by Dr.Songsak Channarukul
Modified by
Dr.Kwankamol Nongpong
sc4312 Lecture 2a 1
2. Computer languages must be precise.
Both syntax and semantics must not be ambiguous.
Language designers use formal notation for syntactic and semantic definitions.
In this class, we are going to discuss the formal syntactic notation.
sc4312 Lecture 2a 2 Introduction
3. Formal specification of syntax requires a set of rules.
A computer language needs two sets of rules to specify its syntax.
Lexical rules define what combinations of characters are valid for the language.
Syntactic rules define what combinations of tokens are valid. sc4312 Lecture 2a 3 Specifying Syntax
4. Regular Expressions (RE) are used to specify lexical rules.
RE generates a regular language.
Context-Free Grammars (CFG) are used to specify syntactic rules.
CFG generates a context-free language. sc4312 Lecture 2a 4 Formal Notations for Syntax
5. The Arabic numerals are composed of digits (0,1,2,…,9).
digit ? 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
A natural number is an arbitrary-length string of digits, beginning with a nonzero digit.
nonzero_digit ? 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number ? nonzero_digit digit * sc4312 Lecture 2a 5 Numerals in Formal Notation
6. The Kleene star (*) is used to indicate zero or more repetitions of the symbol to itself.
It is named after its inventor Stephen Kleene (pronounced klay’nee). sc4312 Lecture 2a 6 The Kleene Star (*)
7. A scanner converts a stream of characters into a stream of tokens. sc4312 Lecture 2a 7 Lexical Analysis
8. The goal of a scanner is to simplify parser’s tasks.
They are usually much faster than parsers.
Scanners discard irrelevant characters in a program (e.g., comments, white spaces).
Parsers are concerned with only tokens.
They don’t care how an identifier is called, but only that it is an identifier. sc4312 Lecture 2a 8 Lexical Analysis (cont.)
9. Groups a stream of characters into tokens
Removes irrelevant characters such as comments and white spaces
Deals with pragmas
sc4312 Lecture 2a 9 Tasks of Lexical Analysis
10. Pragmas are significant comments.
They provide directives or hints to the compiler.
They do not affect the meaning of the program but the compilation process.
Examples are inclusion/exclusion of codes, uses of code, code-improvements, etc. sc4312 Lecture 2a 10 Pragmas
11. Alphabet (?) is a finite set of symbols (e.g., {0, 1}, {A, B, C, . . . , Z}, ASCII, Unicode).
String is a finite sequence of symbols from an alphabet (e.g., ?, ABAC, student).
Language is a set of strings over an alphabet (e.g., (the empty language), all English words, strings that start with a letter followed by any sequence of letters and digits sc4312 Lecture 2a 11 Describing Tokens
12. We need a way to describe which patterns of characters are legible according to the language specification.
Such models include finite automata and regular expression. sc4312 Lecture 2a 12 Now What?
13. Any sets of string that can be defined using regular operations is called a regular language. sc4312 Lecture 2a 13 Regular Language
14. Let A,B ? RL.
A ? B = {x | x ? A or x ? B}
A ? B = {x | x ? A and x ? B}
A* = {x1x2…xk | k ? 0 and ?i xi ? A}
The class of RL is closed under ?, ?, *, -, and ?.
Prove? sc4312 Lecture 2a 14 Regular Operations
15. Models for computers with an extremely limited amount of memory
Singular: An automaton
Plural: Automata
sc4312 Lecture 2a 15 Finite Automata
16. A Finite Automaton is (Q, ?, d, q0, F)
Q is a finite set of states
? is a finite set of alphabets
d : Q x ? ? Q is the transition function
q0 is the starting state
F is the set of final states.
A language is regular (hence regular language) if some finite automaton recognizes it.
sc4312 Lecture 2a 16 Finite Automata
17. Transition Function Like any function, it can also be represented in forms of table.
For instance, sc4312 Lecture 2a 17
18. A finite automaton is nondeterministic if there exists more than one way to get to the next state given the same current state and input symbol.
Nondeterminism is a generalization of determinism.
Formal definition is similar to that of a DFA except the transition function. sc4312 Lecture 2a 18 Nondeterminism
19. A Nondeterministic Finite Automaton (NFA) is (Q, ?, d, q0, F) where
Q is a finite set of states
? is a finite set of alphabets
d : Q x ?? ? P(Q) is the transition function
q0 is the starting state
F is the set of final states.
NFA accepts if at least one path leads to an accepting state. sc4312 Lecture 2a 19 Nondeterministic Finite Automaton
20. sc4312 Lecture 2a 20 Transition Functions
21. Regular expression is a standard way to express languages of tokens.
A regular expression is
A | a ? ? (i.e., a language {a})
? (i.e., an empty string)
? (i.e., an empty language)
R1 ? R2 where R1,R2 ? RE
R1R2 where R1,R2 ? RE
R1* where R1 ? RE
A language is regular if some RE describes it. sc4312 Lecture 2a 21 Regular Expressions
22. RE ? NFA
DFA ? NFA ? RE
RE ? NFA ? DFA sc4312 Lecture 2a 22 Conversions
23. Construct a finite automaton for 0*1(1 | 0)* sc4312 Lecture 2a 23 Exercise
24. Convert the following NFA to a DFA sc4312 Lecture 2a 24 Exercise
25. The new start state is the original start state (in this case 1) union with any state that can be reached from (1) via e-transition.
The accept states are those that include the original accept state i.e., 1 in this case. sc4312 Lecture 2a 25
26. DFA sc4312 Lecture 2a 26
27. Pictorial representation of a Pascal scanner as a finite automaton sc4312 Lecture 2a 27
28. Lexical errors occur in the scanning phase of a compilation process.
They are relatively easy to handle by
Throw away the current, invalid token
Skip forward until a character is found that can legitimately begin a new token
Restart the scanning algorithm
Count on the error-recovery mechanism of the parser… sc4312 Lecture 2a 28 Lexical Errors
29. Write regular expressions for the following:
Java keywords
Integer numbers
C#’s identifiers
Java’s float numbers (e.g., 1e1f, 2.f, .3f, 0f, 3.14f 6.022137e+23f)
Hexadecimal numbers (e.g., 0x1E3a)
sc4312 Lecture 2a 29 Practice