1 / 27

Lexical Analysis

Lexical Analysis. Organization. What is Lexical Analysis ? How do you Build a Lexical Analyzer ? Lexical Specifications How do Lexical Analysis Generators Work ?. What is Lexical Analysis. Token, lexeme and pattern. A token is a pair a token name and an optional token value

jharris
Download Presentation

Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis

  2. Organization • What is Lexical Analysis ? • How do you Build a Lexical Analyzer ? • Lexical Specifications • How do Lexical Analysis Generators Work ?

  3. What is Lexical Analysis

  4. Token, lexeme and pattern • A token is a pair a token name and an optional token value • A pattern is a description of the form that the lexemes of a token may take • A lexeme is a sequence of characters in the source program that matches the pattern for a token

  5. Lexical Analysis • Reading the input source program and giving out a sequence of tokens

  6. Lexical Analysis • Removes comments, white spaces, tab and new line in the source program • In conjunction with the parser creates Symbol Tableused in various stages of the compiler

  7. Lexical Analysis • It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.

  8. How do you Build a Lexical Analyzer ?

  9. An Approach • Taking the input character-by-character and checking for various constructs of that Language. • Rules defining how to identify Keyword, Operator, Identifier or a String Literal will vary

  10. Lexical Analyzer Generators • A Lexical Analyzer Generator is a tool that can generate code to perform Lexical Analysis of the input, given the rules for the basic building blocks of the Language. • Rules for the basic building blocks of a Language are called its Lexical Specifications.

  11. A Lexical Analyzer

  12. Number of tokens?? int main(x,y) int x,y; /* find max of x and y */ { return (x>y?x:y); }

  13. lex • Illustration of generation of a Lexical Analyzer using ‘lex’ – A Lexical Analyzer Generator

  14. Lexical Specifications

  15. Introductory Concepts • Regular Expressions • Pattern that describes a set of strings • Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols.  • The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.

  16. A Regular Expression can be recursively defined as follows − • ε is a Regular Expression indicates the language containing an empty string.  (L (ε) = {ε}) • If ‘a’ is symbol in Σ, then a is regular expression and L(a)={a}, that is a language with one string, of length one.

  17. Operations • The various operations on languages are: • Union of two languages L and M is written as L U M = {s | s is in L or s is in M} • Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M} • The Kleene Closure of a language L is written as L* = Zero or more occurrence of language L.

  18. If x is a regular expression, then: • x* -> zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … } • x+ -> means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x* • x? -> at most one occurrence of x i.e., it can generate either {x} or {e}. • [ ] -> OR bracket [a|b] -> {a,b} • ( ) -> AND bracket (abc) -> {abc} • [a-z] is all lower-case alphabets of English language. • [A-Z] is all upper-case alphabets of English language. • [0-9] is all natural digits used in mathematics.

  19. [-az] –> {-,a,z} • [a-z] -> {a,b,c,d,……z} • [a \-z] –> {a,-,z} • [^a-d]-> {e,f,g,h,I,j,k,l,m,…..z} • ^[a-z] -> Beginning should be from a to z • a$ -> should end with a • . -> anything except new line

  20. Hello -> {hello} • gray| grey -> {gray,grey} • gr(a|e)y -> {gray,grey} • gr[ae]y -> {gray,grey} • B[aeiou]bble -> {babble, bebble, bibble, bobble, bubble} • [b-chm-pP]at|ot -> {bat, cat, hat, mat, nat, oat, pat, Pat, ot}

  21. colou?r -> {color, colour} • go*gle -> {ggle,gogle, google, gooogle,….} • go+gle -> {gogle,google, gooogle,….} • Z{3} -> {zzz} • Z{3,6} -> {zzz,zzzz,zzzzz,zzzzzz} • ^dog -> begins with dog • dog$ -> ends with dog

  22. Representing occurrence of symbols using regular expressions letter = [a – z] or [A – Z] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9] sign = [ + | - ] Representing language tokens using regular expressions: Decimal = (sign)?(digit)+ Identifier = letter(letter | digit)*

  23. Lexical Specification File • Declarations • Translation rules • Auxiliary functions

  24. Declarations • Regular definitions that can be used in Translation Rules • Enclosed within %{and %} • #defines, C prototype declarations of the functions used in Translation Rules • #include statements for the library functions used in Translation rules

  25. Translation Rules • Pattern-Action Pairs • Where Pattern is a regular Expression and the Action is a C language Program Segment • The action is typically a return Statement indicating the type of token that has been matched

  26. Translation Rules • Generated Global Variables that can be used in the Action Statements. • yytext contains the Lexeme, • yyleng gives the Length of the Lexeme. • Tokens that do not have any significance for the Parser (like White Space, New Line etc) the action statement would not have a return Statement

  27. Auxiliary Functions • Definition of the C Functions used in the Action Statements. • The whole section is copied “as is” into lex.yy.c. • yylex routine is called repeatedly to continue getting the next token until the End of the Input

More Related