170 likes | 336 Views
Lecture 4: Indexing Files. Inverted File Lexical Analysis Stop lists. Indexing. Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak. Creating inverted files.
E N D
Lecture 4:Indexing Files • Inverted File • Lexical Analysis • Stop lists
Indexing • Arrangement of data (data structure) to permit fast searching • Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak
Creating inverted files Word Extraction Word IDs Original Documents W1:d1,d2,d3 W2:d2,d4,d7,d9 Wn :di,…dn Inverted Files Document IDs
D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file • Map the file names to file IDs • Consider the following Original Documents
D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced itsfirst PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. Creating Inverted file Red: stop word
D1 depart comput scienc establish D2 depart launch bsc hons comput studi D3 follow msc comput scienc start D4 depart produc phd graduat D5 staff contribut intellectuprofession advanc field Creating Inverted file After stemming, make lowercase (option), delete numbers (option)
Words Documents Words Documents depart d1,d2,d4 produc d4 comput d1,d2,d3 phd d4 scienc d1,d3 graduat d4 establish d1 staff d5 launch d2 contribut d5 bsc d2 intellectu d5 hons d2 profession d5 studi d2 advanc d5 follow d3 field d5 msc d3 start d3 Creating Inverted file (unsorted)
Words Documents Words Documents advanc d5 msc d3 bsc d2 phd d4 comput d1,d2,d3 produc d4 contribut d5 profession d5 depart d1,d2,d4 scienc d1,d3 establish d1 staff d5 field d5 start d3 follow d3 studi d2 graduat d4 intellectu d5 launch d2 Creating Inverted file (sorted)
Searching on Inverted File • Binary Search • Using in the small scale • Create thesaurus and combining techniques such as: • Hashing • B+tree • Pointer to the address in the indexed file
Lexical Analysis for indexing • Word extraction • Spaces as English words boundaries • Chinese word segmentation • Stop words elimination • “a”,”an”,”the”,”about”,”etc”,”every”,”you”,etc. • Word stemming
Lexical Analysis • Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens • Lexical analysis is the first stage of: • Automatic indexing • Query processing
Lexical Analysis for Automatic Indexing • What counts as a word or token in the indexing scheme? (an easy problem?) • Digits • “Year 2000”, “Y2K” • Hyphens • “F-16”“MS-DOS” • Other Punctuation • “COMMAND.COM”“max_size” (often in C code) • Case • IBM or ibm
Lexical Analysis for Automatic Indexing (cont.) • No technical difficulty in solving any of these problems • Must think about them carefully • Tradeoff between recall and precision • Breaking up hyphenated terms increase recall but decreases precision • Preserving case distinctions enhances precision but decreases recall
Lexical Analysis for Query Processing • Depends on the design of the lexical analyzer for automatic indexing • Distinguish operators (Boolean operators, weighting function operators etc.) • Process certain characters: • Control characters • “” for phase search, {} for priority • Disallowed punctuation characters (error)
STOPLISTS • Many of the most frequently occurring words in English (“the” ,”of” etc.) are worthless as index terms • Eliminating such words • Speeds processing • Saves huge amounts of space in indexes • Does not damage retrieval effectiveness • Stoplists are used to eliminates such words. E.g., • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words • http://bll.epnet.com/help/ehost/Stop_Words.htm • http://www.syger.com/jsc/docs/stopwords/english.htm
STOPLISTS • Choices of words in stop list may vary from person to person. • The general idea is to find words that occur often so that they are not good terms for information retrieval. • How to use vector space model to find out a list of stop words? • How to find stop words in Chinese?