IR Homework #1

IR Homework #1 By J. H. Wang Mar. 21, 2014

Programming Exercise #1: Vector Space Retrieval • Goal: to build an inverted index for a text collection, and to search relevant documents for a given query • Input: a set of text documents, and a user query • Output: relevant documents in a ranked list • Tools: either open source tools or write your own code in any programming language

Major Tasks • Indexing • Given a set of text documents, build an inverted index • Searching • Given a user query, find the most relevant documents in a ranked list

Steps in Vector Space Retrieval 1 2 NTUT CSIE

Some Open Source Tools • Apache Lucene/Solr (in Java) • The Lemur Project, Indri, Galago – by CMU/Umass, (in C++) • Terrier – by U. Glasgow (in Java) • …

Input 1: the Test Collection • ClueWeb09 dataset • http://lemurproject.org/clueweb09.php/ • 1,040,809,705 Web pages in 10 languages, in Jan.-Feb. 2009 • 5TB, compressed (25TB, uncompressed) • File format: WARC (Web ARChive file format) • http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml • Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Sample+Files • Each file contains about 40,000 Web pages, in 1GB • Each team will be randomly allocated different files!

Other Test Collections • Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html • About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) • Needs to sign agreements • Reuters-21578: http://www.daviddlewis.com/resources/testcollections/reuters21578/ • 21,578 news articles in 1987 (28.0MB uncompressed) • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI • Ex: The Time Collection: 423 documents (1.5MB)

Indexing: Building Inverted Index • E.g.: Using the standard positional index as the format (Chap. 1 & 2): • Dictionary file: a sorted list of vocabularies (in separate lines) • Postings list: for each term, a list of occurrences in the original text • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2, …>; …> (as in Fig. 2.11, Sec. 2.4, p.38) • dfi: document frequencyof termi • tfij: term frequencyof termi in docj • to, 993427: <1, 6: <7, 18, 33, 72, 86, 231>; 2, 5: <1, 17, 74, 222, 255>; … > • …

Design Issues • pos means the token positions in the body of documents • This can facilitate easier implementation in following steps, e.g., proximity search • You can design different index formats, as long as • The necessary information can be accessed for ranking • Dictionary: terms ti and the corresponding document frequency dfi • Postings: (DocID, term frequency tfij, Loc) for each term • Preprocessing should be handled with care • Different formats for different collections • Digits, hyphens, punctuation marks, …

Optional Functionality • Efficiency issues • A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer • Skip pointers • Tokenization • Case folding • Stopword removal • Stemming • Able to be turned on/off by a parameter trigger

Input 2: User Query • Simple queries • Single keywords • Ex: Tucson, Microsoft, … • Free texts with multiple words • Ex: United States, Mount Carmel, … • Simple Boolean search • Ex: open source AND Linux, software engineer OR project manager, …

Output: Ranked List • A ranked list of search results from ClueWeb09 collection • Ranking: vector space model • Term weighting scheme: TF-IDF • Similarity estimation: cosine similarity between query and document vectors

Searching: scoring and ranking documents • Vector space model • Term weighting: TF-IDF • Similarity estimation: cosine similarity between query q and document vectors dj wij = (1+ log tfij) * log (N/dfi)

Example Output • Ex: • Query: “Hong Kong” • Result: <doc#> <similarity score> • E.g.: 261 0.85135 0.67324 0.3…

Optional Features • Optional functionalities • Better user interface for search • Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) • Query processing: spell-correction, phonetic correction, … (Ch.3) • Different term weighting schemes: variants of TF-IDF, … (Ch.6) • In-exact top-k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) • Able to be turned on/off by a parameter trigger

Submission • Your submission *should* include • The source code (or your configuration of installed open source tool) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Instructions for compilation/execution environments (ex: Java Runtime Environment, special compilers, …) • Major difficulties encountered • Team members list: The names and the responsible parts of each individual member should be clearly identified • Due: three weeks (Apr. 18, 2014)

Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.31/net2ftp • FTP server: localhost • User name & password: Your student ID • Preparing your submission file: as one single compressed file • Remember to specify the names and student IDsof yourteam members in the files and documentation • If you cannot successfully submit your work, please contact with the TA (Mr. Huang, @ R1424, Technology Building) • Available Time: Mon. morning or Tue. Afternoon • E-mail: jsn900211 @ gmail . com

Evaluation • Minimum requirement: correctness for simple queries in vector space retrieval • Using the (partial) ClueWeb09 Test Collection and some sample queries as the input, the ranked list of documents retrieved by your system will be checked • Optional features will be considered as bonus • You might be required to demo if the program submitted was unable to compile/run by the TA

Any Questions or Comments?

IR Homework #1

IR Homework #1

Presentation Transcript

IR Homework #1

Homework : “Colons and Semicolons” Bring your IR book tomorrow.

Homework! Oh, Homework!

IR COMD POLS COMD IR IR Global COMD POLS POLS IR Global Psychology IR COMD IR IR

IR Homework #2

IR Homework #3

“IR”

IR Homework #2

Homework 1 Homework 2 Homework 3 Homework 4