20-760 Web-based Information Architectures

20-760 Web-based Information Architectures How to Construct a Inverted List

Parsing & Indexing: Overview • Tasks • Build a set of indices • inverted list, idf, document id, normalized tf, word positions,… • Speed (Example) • On a PC of 750MHz CPU and 256M memory, a C++ program that builds indices without positions runs 46-56 seconds on the HTML collection of 50M. (The cleanup collection is 30M) • A few seconds for your Java program on the Reuters-1000 collection • Memory • 1-5% the size of the total uncompressed documents • E.g. 128 MB RAM for 2 GB text

Document Parsing: sample document

Document Parsing • Read the corpus file “reut2-1000.plain” • Identify the document boundary • <REUTERS ID=“document id”> • Process each document to extract: • Document ID • Segment the text into tokens • e.g. Apple, REUTERS, U.S. … • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z

Data Structure • You can use whatever you like, but hashtable is simple to implement • Hashtable • Java provide such classes in java.util • Perl has hashes as a datatype, e.g. %words • C++ implements the associated list in Standard Templete Library(STL). The template class is called map. Internal implementations are either hashes or B-tree. • You can also implement your own hashtable(see Ch13 “Information Retrieval: Data Structures & Algorithms” by William B. Frakes, Ricardo Baeza-Yates) • Searching is fast O(1), but scanning in sequential order is not possible • B-tree and B+ tree (see section 2.3 of the above book for details)

Associated List • Associated list is a data structure, a list of pairs. Each pair is composed of a key and a value. Value could be a complex data structure. • In our case: Key/value -> Term / Associated posting list • Access an associated list. You have the key, you want to access the associated value quickly. • Many ways of implementing the associated list: Hash, B-tree, Array

Hashtable • Hashtable provides the insertion/access of the associated value in a constant time • Hashtable uses a hash function to map the key to the address that the associated value is stored Hash(key) value

Indices • Format • <term> <idf> <doc id>:<normalized tf>:<tf>:<positions> • positions are separated by commas • IDF(t) = log2(N/n) where N is the number of documents in the whole collection, n is the number of documents that contains the term t • TFnom = TF/TFmax • Sample

Stopword Recognition • There are usually fewer than 500 stopwords • Some systems have very few • Every word token is checked, so the test should be very fast • Store the stopword list in a hash table • Since stopword lists evolve slowly, calculate a perfect hash code • Lookup each word token in the hash table • If found, the token is a stopword, so discard it • Document length & word locations should count stopwords • Example: “Library of Congress” has length of 3 Location: 1 2 3

Good Luck! • Due on 7:00pm July 19.

20-760 Web-based Information Architectures

20-760 Web-based Information Architectures

Presentation Transcript

Information System Architectures

Web-Based Multimedia 20 in 20

Under The Hood [Part II] Web-Based Information Architectures

INFORMATION ARCHITECTURES FOR SEMANTIC WEB APPLICATIONS

EVALUATING WEB BASED INFORMATION

Text Mining -- Extraction Web-Based Information Architectures

Web-Based Information Systems

Semantic-based Architectures

Web-Based Information Systems

Scalable Web Architectures

Scalable Web Architectures

Scalable Web Architectures

Web-based Information Architectures

Information Systems Architectures

Web-Based Information Systems

Managing Information – Web Based Information

Web-based Information Architectures MSEC 20-760 Mini II

Web-based Information Architectures MSEC 20-760 – Mini II –Fall 2003

Recommendations for Java-Based Web Application Architectures

Web-Based Information Systems

Web Based Information texts

Web-Based Information Systems