110 likes | 132 Views
Chapter 7 Lexical Analysis and Stoplists. 9042608 薛茹分 9042609 吳佳真 9042610 吳寶鳳. Outline. Introduction What counts as a word or token? Implementing a Lexical Analyzer Implementing Stoplists Hashing Finite State Machine A Lexical Analyzer Generator DFA for stoplists. Introduction.
E N D
Chapter 7Lexical Analysis and Stoplists 9042608 薛茹分 9042609 吳佳真 9042610 吳寶鳳
Outline • Introduction • What counts as a word or token? • Implementing a Lexical Analyzer • Implementing Stoplists • Hashing • Finite State Machine • A Lexical Analyzer Generator • DFA for stoplists
Introduction • Lexical analysis • The process of converting an input stream of characters into a stream of words or tokens. • The first stage of automatic indexing and of query processing. • Tokens are groups of characters with collective significance. • Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. • Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items.
Introduction (Cont’d) • Stoplists • Many of the most frequently occurring words in English (like “the,”“of,”“and,”“to,” etc.) are worthless as index terms. • Eliminating such words from consideration early in automatic indexing speeds processing, saves huge amounts of space in indexes, and does not damage retrieval effectiveness. • A list of words filtered out during automatic indexing because they make poor index terms is called a stoplists or a negative dictionary.
What counts as a word or token? • Some consideration • Digits (e.g. B6, B12) • Hyphens (e.g. F-16, MS-DOS) • Other Punctuation (e.g. COMMAND.COM, OS/2) • Case (Upper to lower case)
1 2 8 7 3 4 5 6 Implementing a Lexical Analyzer • Finite state machine Letter, digit ( ) Space & 0 | ^ eos other
Implementing a Lexical Analyzer • GetToken
Implementing Stoplists • There are two ways to filter stoplist words from an input token stream: • Examine lexical analyzer output and remove any stopwords. - Hashing • Remove stopwords as part of lexical analysis. - Deterministic finite automata (DFA)
A Lexical Analyzer Generator Read stop words file Stop words file Create a DFA Input file Lexical analysis Close the input file and return terms terms
q1 q4 q5 q6 DFA for stoplists {, d} {, n, nd} n d a L0{a, an, and, in, into, to} {} {, to} {n, nto} i n q0 q2 t t o {o} q3
Thank you!!! Questions and Comments