280 likes | 378 Views
Compiler design. Lecture 2: lexical analysis Lecturer :Batool Almasri. Lexical Analysis. -It’s the first phase of compiler is also known as Scanner -Lexical breaks the input in to smallest meaningful sequences called TOKENS which Parser uses for syntax analyzer .
E N D
Compiler design Lecture 2: lexical analysis Lecturer :Batool Almasri
Lexical Analysis -It’s the first phase of compiler is also known as Scanner -Lexical breaks the input in to smallest meaningful sequences called TOKENS which Parser uses for syntax analyzer . -Some token can be define as IDENTIFIERS, KEYWORDS, OPERATORS, PUNCTUATION MARKS….. • IT remove • 1- white space ( like Tab, Blank, New line ) • 2- and comments.
Lexical analysis -the part of input stream that qualifies for token is called LEXEME. -EXAMPLE: If QUALIFIES for key word in C language hence “IF” is lexeme in this case. • -Lexical analyzer keep track of new line character so that it can give the line number in case of any errors in the source program.
Lexical analysis • The goal of lexical analysis is to divide the program text into its words or what we call in compiler speak , the TOKENS. • Eg: IF (x==y) then printf (“hello %D ,A); ELSE print (“%d” , b); • - token class corresponds to set of strings: • - IDENTIFIER : String of letter and digits starting with a letter • -INTEGER : a non empty string of digits • -KEYWORD: “else” or “if” or “begin” or …….. • - whitespace: a non empty sequence of blanks, newlines , and tabs
- It reads character streams from the source code , checks for legal tokens and passes the data to the syntax analyzer when it demands.
Lexical analysis • Lexical analyser classify program substrings according to role (token class) and communicate tokens to the parser. • IF (i==j) • Z=0; • Else • Z=1; • Lexical analyzer reads its as: • \tif (i==j)\N\T\TZ=0;\N\TELSE\N\T\TZ=1; • \T :TAB (WHITE SPACES \N :NEWLINE • -LEXICAL ANALYSER REMOVE S ALL THE WHITE SPACES AND BLANKS AND BLANKS CHRACTER FROM THE SOURCE CODE.
Regular language • To define the regular languages we generally use something called regular expressions and each regular expression is a set. • -there are two regular expressions: • 1-Single character ‘C’ = { “C”} • That’s an expression and what notes is a language containing one string. • 2-Epsilon ={} • That contain again just a single string ,this time the empty string , and one thing that’s important to keep in mind is that epsilon is not an empty language.
Regular language • Three compound regular expressions : • 1-Union : A+B also written in lex as A|B means either A or B . • 2- Concatenation :AB • MEANS A AND B • 3-Iteration *: A* this is kleene star or closure and a star equal union • -A+ =AA* • E.G: (a)* means a can occur 0 or more times • E.G: (a+) = means a can occur 1 or more times minimum 1 time a should come.
Representing symbols using regular expressions • 1- letter = [A- Z] • 2-DIGIT = 0|1|2|3|4|5|6|7|8|9 • 3- SIGN = + | -
HOW TO RECIGNIZE TOKEN? • FINITE STATE AUTIMATA
FINITE AUTOMATA • THE HEART of the transition of lex turning input program to lexical analyzer is finite automata. • have : • 1- one start state. • 2- many final state • 3- each state is labeled with a state name • 4-directed edge labeled with symbols • Two types : • -Non deterministic • -Deterministic