Lexical Analysis

Lexical Analysis

Organization • What is Lexical Analysis ? • How do you Build a Lexical Analyzer ? • Lexical Specifications • How do Lexical Analysis Generators Work ?

What is Lexical Analysis

Token, lexeme and pattern • A token is a pair a token name and an optional token value • A pattern is a description of the form that the lexemes of a token may take • A lexeme is a sequence of characters in the source program that matches the pattern for a token

Lexical Analysis • Reading the input source program and giving out a sequence of tokens

Lexical Analysis • Removes comments, white spaces, tab and new line in the source program • In conjunction with the parser creates Symbol Tableused in various stages of the compiler

Lexical Analysis • It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

How do you Build a Lexical Analyzer ?

An Approach • Taking the input character-by-character and checking for various constructs of that Language. • Rules defining how to identify Keyword, Operator, Identifier or a String Literal will vary

Lexical Analyzer Generators • A Lexical Analyzer Generator is a tool that can generate code to perform Lexical Analysis of the input, given the rules for the basic building blocks of the Language. • Rules for the basic building blocks of a Language are called its Lexical Specifications.

A Lexical Analyzer

Number of tokens?? int main(x,y) int x,y; /* find max of x and y */ { return (x>y?x:y); }

lex • Illustration of generation of a Lexical Analyzer using ‘lex’ – A Lexical Analyzer Generator

Lexical Specifications

Introductory Concepts • Regular Expressions • Pattern that describes a set of strings • Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. • The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.

A Regular Expression can be recursively defined as follows − • ε is a Regular Expression indicates the language containing an empty string. (L (ε) = {ε}) • If ‘a’ is symbol in Σ, then a is regular expression and L(a)={a}, that is a language with one string, of length one.

Operations • The various operations on languages are: • Union of two languages L and M is written as L U M = {s | s is in L or s is in M} • Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} • The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L.

If x is a regular expression, then: • x* -> zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … } • x+ -> means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x* • x? -> at most one occurrence of x i.e., it can generate either {x} or {e}. • [ ] -> OR bracket [a|b] -> {a,b} • ( ) -> AND bracket (abc) -> {abc} • [a-z] is all lower-case alphabets of English language. • [A-Z] is all upper-case alphabets of English language. • [0-9] is all natural digits used in mathematics.

[-az] –> {-,a,z} • [a-z] -> {a,b,c,d,……z} • [a \-z] –> {a,-,z} • [^a-d]-> {e,f,g,h,I,j,k,l,m,…..z} • ^[a-z] -> Beginning should be from a to z • a$ -> should end with a • . -> anything except new line

Hello -> {hello} • gray| grey -> {gray,grey} • gr(a|e)y -> {gray,grey} • gr[ae]y -> {gray,grey} • B[aeiou]bble -> {babble, bebble, bibble, bobble, bubble} • [b-chm-pP]at|ot -> {bat, cat, hat, mat, nat, oat, pat, Pat, ot}

colou?r -> {color, colour} • go*gle -> {ggle,gogle, google, gooogle,….} • go+gle -> {gogle,google, gooogle,….} • Z{3} -> {zzz} • Z{3,6} -> {zzz,zzzz,zzzzz,zzzzzz} • ^dog -> begins with dog • dog$ -> ends with dog

Representing occurrence of symbols using regular expressions letter = [a – z] or [A – Z] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign = [ + | - ] Representing language tokens using regular expressions: Decimal = (sign)?(digit)+ Identifier = letter(letter | digit)*

Lexical Specification File • Declarations • Translation rules • Auxiliary functions

Declarations • Regular definitions that can be used in Translation Rules • Enclosed within %{and %} • #defines, C prototype declarations of the functions used in Translation Rules • #include statements for the library functions used in Translation rules

Translation Rules • Pattern-Action Pairs • Where Pattern is a regular Expression and the Action is a C language Program Segment • The action is typically a return Statement indicating the type of token that has been matched

Translation Rules • Generated Global Variables that can be used in the Action Statements. • yytext contains the Lexeme, • yyleng gives the Length of the Lexeme. • Tokens that do not have any significance for the Parser (like White Space, New Line etc) the action statement would not have a return Statement

Auxiliary Functions • Definition of the C Functions used in the Action Statements. • The whole section is copied “as is” into lex.yy.c. • yylex routine is called repeatedly to continue getting the next token until the End of the Input

Lexical Analysis