100 likes | 228 Views
20-760 Web-based Information Architectures. How to Construct a Inverted List. Parsing & Indexing: Overview. Tasks Build a set of indices inverted list, idf, document id, normalized tf, word positions,… Speed (Example)
E N D
20-760 Web-based Information Architectures How to Construct a Inverted List
Parsing & Indexing: Overview • Tasks • Build a set of indices • inverted list, idf, document id, normalized tf, word positions,… • Speed (Example) • On a PC of 750MHz CPU and 256M memory, a C++ program that builds indices without positions runs 46-56 seconds on the HTML collection of 50M. (The cleanup collection is 30M) • A few seconds for your Java program on the Reuters-1000 collection • Memory • 1-5% the size of the total uncompressed documents • E.g. 128 MB RAM for 2 GB text
Document Parsing • Read the corpus file “reut2-1000.plain” • Identify the document boundary • <REUTERS ID=“document id”> • Process each document to extract: • Document ID • Segment the text into tokens • e.g. Apple, REUTERS, U.S. … • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z
Data Structure • You can use whatever you like, but hashtable is simple to implement • Hashtable • Java provide such classes in java.util • Perl has hashes as a datatype, e.g. %words • C++ implements the associated list in Standard Templete Library(STL). The template class is called map. Internal implementations are either hashes or B-tree. • You can also implement your own hashtable(see Ch13 “Information Retrieval: Data Structures & Algorithms” by William B. Frakes, Ricardo Baeza-Yates) • Searching is fast O(1), but scanning in sequential order is not possible • B-tree and B+ tree (see section 2.3 of the above book for details)
Associated List • Associated list is a data structure, a list of pairs. Each pair is composed of a key and a value. Value could be a complex data structure. • In our case: Key/value -> Term / Associated posting list • Access an associated list. You have the key, you want to access the associated value quickly. • Many ways of implementing the associated list: Hash, B-tree, Array
Hashtable • Hashtable provides the insertion/access of the associated value in a constant time • Hashtable uses a hash function to map the key to the address that the associated value is stored Hash(key) value
Indices • Format • <term> <idf> <doc id>:<normalized tf>:<tf>:<positions> • positions are separated by commas • IDF(t) = log2(N/n) where N is the number of documents in the whole collection, n is the number of documents that contains the term t • TFnom = TF/TFmax • Sample
Stopword Recognition • There are usually fewer than 500 stopwords • Some systems have very few • Every word token is checked, so the test should be very fast • Store the stopword list in a hash table • Since stopword lists evolve slowly, calculate a perfect hash code • Lookup each word token in the hash table • If found, the token is a stopword, so discard it • Document length & word locations should count stopwords • Example: “Library of Congress” has length of 3 Location: 1 2 3
Good Luck! • Due on 7:00pm July 19.