Lecture 4: Indexing Files

Lecture 4:Indexing Files • Inverted File • Lexical Analysis • Stop lists

Indexing • Arrangement of data (data structure) to permit fast searching • Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak

Creating inverted files Word Extraction Word IDs Original Documents W1:d1,d2,d3 W2:d2,d4,d7,d9 Wn :di,…dn Inverted Files Document IDs

D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file • Map the file names to file IDs • Consider the following Original Documents

D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced itsfirst PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file Red: stop word

D1 depart comput scienc establish D2 depart launch bsc hons comput studi D3 follow msc comput scienc start D4 depart produc phd graduat D5 staff contribut intellectuprofession advanc field Creating Inverted file After stemming, make lowercase (option), delete numbers (option)

Words Documents Words Documents depart d1,d2,d4 produc d4 comput d1,d2,d3 phd d4 scienc d1,d3 graduat d4 establish d1 staff d5 launch d2 contribut d5 bsc d2 intellectu d5 hons d2 profession d5 studi d2 advanc d5 follow d3 field d5 msc d3 start d3 Creating Inverted file (unsorted)

Words Documents Words Documents advanc d5 msc d3 bsc d2 phd d4 comput d1,d2,d3 produc d4 contribut d5 profession d5 depart d1,d2,d4 scienc d1,d3 establish d1 staff d5 field d5 start d3 follow d3 studi d2 graduat d4 intellectu d5 launch d2 Creating Inverted file (sorted)

Searching on Inverted File • Binary Search • Using in the small scale • Create thesaurus and combining techniques such as: • Hashing • B+tree • Pointer to the address in the indexed file

Lexical Analysis for indexing • Word extraction • Spaces as English words boundaries • Chinese word segmentation • Stop words elimination • “a”,”an”,”the”,”about”,”etc”,”every”,”you”,etc. • Word stemming

Lexical Analysis • Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens • Lexical analysis is the first stage of: • Automatic indexing • Query processing

Lexical Analysis for Automatic Indexing • What counts as a word or token in the indexing scheme? (an easy problem?) • Digits • “Year 2000”, “Y2K” • Hyphens • “F-16”“MS-DOS” • Other Punctuation • “COMMAND.COM”“max_size” (often in C code) • Case • IBM or ibm

Lexical Analysis for Automatic Indexing (cont.) • No technical difficulty in solving any of these problems • Must think about them carefully • Tradeoff between recall and precision • Breaking up hyphenated terms increase recall but decreases precision • Preserving case distinctions enhances precision but decreases recall

Lexical Analysis for Query Processing • Depends on the design of the lexical analyzer for automatic indexing • Distinguish operators (Boolean operators, weighting function operators etc.) • Process certain characters: • Control characters • “” for phase search, {} for priority • Disallowed punctuation characters (error)

STOPLISTS • Many of the most frequently occurring words in English (“the” ,”of” etc.) are worthless as index terms • Eliminating such words • Speeds processing • Saves huge amounts of space in indexes • Does not damage retrieval effectiveness • Stoplists are used to eliminates such words. E.g., • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words • http://bll.epnet.com/help/ehost/Stop_Words.htm • http://www.syger.com/jsc/docs/stopwords/english.htm

STOPLISTS • Choices of words in stop list may vary from person to person. • The general idea is to find words that occur often so that they are not good terms for information retrieval. • How to use vector space model to find out a list of stop words? • How to find stop words in Chinese?

Lecture 4: Indexing Files

Lecture 4: Indexing Files

Presentation Transcript

Exercise: Indexing of the electron diffraction patterns

Indexing and Hashing

Reading and Review Chapter 12: Indexing and Hashing

Chapter 5 Working with Files and Directories PHP Programming with MySQL 2 nd Edition

INFO624 -- Week 9 Effective Information Retrieval

DRUG MASTER FILES

Lecture 8

Chapter 11: Indexing and Hashing

Managing Files and Directories (part 1)

Multimedia Indexing and Dimensionality Reduction

What we have covered?

Configuration files must Die!!!

Introduction to SRS

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Introduction to File Systems

Abinit Workshop

Chapter4: Spatial Storage and Indexing

Indexing

Chapter 2 Modeling

Chapter 2 Modeling

Files and Streams