1 / 19

IR Homework #1

IR Homework #1. By J. H. Wang Mar. 21, 2014. Programming Exercise #1: Vector Space Retrieval. Goal: to build an inverted index for a text collection, and to search relevant documents for a given query Input : a set of text documents, and a user query

hinto
Download Presentation

IR Homework #1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Homework #1 By J. H. Wang Mar. 21, 2014

  2. Programming Exercise #1: Vector Space Retrieval • Goal: to build an inverted index for a text collection, and to search relevant documents for a given query • Input: a set of text documents, and a user query • Output: relevant documents in a ranked list • Tools: either open source tools or write your own code in any programming language

  3. Major Tasks • Indexing • Given a set of text documents, build an inverted index • Searching • Given a user query, find the most relevant documents in a ranked list

  4. Steps in Vector Space Retrieval 1 2 NTUT CSIE

  5. Some Open Source Tools • Apache Lucene/Solr (in Java) • The Lemur Project, Indri, Galago – by CMU/Umass, (in C++) • Terrier – by U. Glasgow (in Java) • …

  6. Input 1: the Test Collection • ClueWeb09 dataset • http://lemurproject.org/clueweb09.php/ • 1,040,809,705 Web pages in 10 languages, in Jan.-Feb. 2009 • 5TB, compressed (25TB, uncompressed) • File format: WARC (Web ARChive file format) • http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml • Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Sample+Files • Each file contains about 40,000 Web pages, in 1GB • Each team will be randomly allocated different files!

  7. Other Test Collections • Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html • About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) • Needs to sign agreements • Reuters-21578: http://www.daviddlewis.com/resources/testcollections/reuters21578/ • 21,578 news articles in 1987 (28.0MB uncompressed) • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI • Ex: The Time Collection: 423 documents (1.5MB)

  8. Indexing: Building Inverted Index • E.g.: Using the standard positional index as the format (Chap. 1 & 2): • Dictionary file: a sorted list of vocabularies (in separate lines) • Postings list: for each term, a list of occurrences in the original text • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2, …>; …> (as in Fig. 2.11, Sec. 2.4, p.38) • dfi: document frequencyof termi • tfij: term frequencyof termi in docj • to, 993427: <1, 6: <7, 18, 33, 72, 86, 231>; 2, 5: <1, 17, 74, 222, 255>; … > • …

  9. Design Issues • pos means the token positions in the body of documents • This can facilitate easier implementation in following steps, e.g., proximity search • You can design different index formats, as long as • The necessary information can be accessed for ranking • Dictionary: terms ti and the corresponding document frequency dfi • Postings: (DocID, term frequency tfij, Loc) for each term • Preprocessing should be handled with care • Different formats for different collections • Digits, hyphens, punctuation marks, …

  10. Optional Functionality • Efficiency issues • A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer • Skip pointers • Tokenization • Case folding • Stopword removal • Stemming • Able to be turned on/off by a parameter trigger

  11. Input 2: User Query • Simple queries • Single keywords • Ex: Tucson, Microsoft, … • Free texts with multiple words • Ex: United States, Mount Carmel, … • Simple Boolean search • Ex: open source AND Linux, software engineer OR project manager, …

  12. Output: Ranked List • A ranked list of search results from ClueWeb09 collection • Ranking: vector space model • Term weighting scheme: TF-IDF • Similarity estimation: cosine similarity between query and document vectors

  13. Searching: scoring and ranking documents • Vector space model • Term weighting: TF-IDF • Similarity estimation: cosine similarity between query q and document vectors dj wij = (1+ log tfij) * log (N/dfi)

  14. Example Output • Ex: • Query: “Hong Kong” • Result: <doc#> <similarity score> • E.g.: 261 0.85135 0.67324 0.3…

  15. Optional Features • Optional functionalities • Better user interface for search • Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) • Query processing: spell-correction, phonetic correction, … (Ch.3) • Different term weighting schemes: variants of TF-IDF, … (Ch.6) • In-exact top-k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) • Able to be turned on/off by a parameter trigger

  16. Submission • Your submission *should* include • The source code (or your configuration of installed open source tool) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Instructions for compilation/execution environments (ex: Java Runtime Environment, special compilers, …) • Major difficulties encountered • Team members list: The names and the responsible parts of each individual member should be clearly identified • Due: three weeks (Apr. 18, 2014)

  17. Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: • Submission site: http://140.124.183.31/net2ftp • FTP server: localhost • User name & password: Your student ID • Preparing your submission file: as one single compressed file • Remember to specify the names and student IDsof yourteam members in the files and documentation • If you cannot successfully submit your work, please contact with the TA (Mr. Huang, @ R1424, Technology Building) • Available Time: Mon. morning or Tue. Afternoon • E-mail: jsn900211 @ gmail . com

  18. Evaluation • Minimum requirement: correctness for simple queries in vector space retrieval • Using the (partial) ClueWeb09 Test Collection and some sample queries as the input, the ranked list of documents retrieved by your system will be checked • Optional features will be considered as bonus • You might be required to demo if the program submitted was unable to compile/run by the TA

  19. Any Questions or Comments?

More Related