Chapter 7 Lexical Analysis and Stoplists

Chapter 7Lexical Analysis and Stoplists 9042608 薛茹分 9042609 吳佳真 9042610 吳寶鳳

Outline • Introduction • What counts as a word or token? • Implementing a Lexical Analyzer • Implementing Stoplists • Hashing • Finite State Machine • A Lexical Analyzer Generator • DFA for stoplists

Introduction • Lexical analysis • The process of converting an input stream of characters into a stream of words or tokens. • The first stage of automatic indexing and of query processing. • Tokens are groups of characters with collective significance. • Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. • Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items.

Introduction (Cont’d) • Stoplists • Many of the most frequently occurring words in English (like “the,”“of,”“and,”“to,” etc.) are worthless as index terms. • Eliminating such words from consideration early in automatic indexing speeds processing, saves huge amounts of space in indexes, and does not damage retrieval effectiveness. • A list of words filtered out during automatic indexing because they make poor index terms is called a stoplists or a negative dictionary.

What counts as a word or token? • Some consideration • Digits (e.g. B6, B12) • Hyphens (e.g. F-16, MS-DOS) • Other Punctuation (e.g. COMMAND.COM, OS/2) • Case (Upper to lower case)

1 2 8 7 3 4 5 6 Implementing a Lexical Analyzer • Finite state machine Letter, digit ( ) Space & 0 | ^ eos other

Implementing a Lexical Analyzer • GetToken

Implementing Stoplists • There are two ways to filter stoplist words from an input token stream: • Examine lexical analyzer output and remove any stopwords. - Hashing • Remove stopwords as part of lexical analysis. - Deterministic finite automata (DFA)

A Lexical Analyzer Generator Read stop words file Stop words file Create a DFA Input file Lexical analysis Close the input file and return terms terms

q1 q4 q5 q6 DFA for stoplists {, d} {, n, nd} n d a L0{a, an, and, in, into, to} {} {, to} {n, nto} i n q0 q2 t t o {o} q3

Thank you!!! Questions and Comments

Chapter 7 Lexical Analysis and Stoplists

Chapter 7 Lexical Analysis and Stoplists

Presentation Transcript

Lexical Analysis

Chapter 3: Lexical Analysis

LEXICAL ANALYSIS AND STOPLISTS

Lexical Analysis

Lexical Analysis

Lexical Analysis

Chapter 2 Lexical Analysis

Lexical Analysis

Chapter 3: Lexical Analysis

Chapter 3: Lexical Analysis

Lexical and Syntax Analysis Chapter 4

LEXICAL ANALYSIS

Chapter 2 Lexical Analysis

Chapter 3: Lexical Analysis

CHAPTER 3 LEXICAL ANALYSIS

Chapter 3. Lexical Analysis (1)

Chapter 4 Lexical analysis

Chapter 4 Lexical analysis

Lexical Analysis

Chapter 2 Lexical Analysis