1 / 21

IRTools Software Overview

IRTools Software Overview. Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu. Download & Participate. IRTools is a work in progress . Check back in the spring for more software and test cases. Currently, only some parts work Want to help? We use CVS for distributed development

dianas
Download Presentation

IRTools Software Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu

  2. Download & Participate • IRTools is a work in progress. Check back in the spring for more software and test cases. Currently, only some parts work • Want to help? We use CVS for distributed development • Our project page: http://sourceforge.net/projects/irtools

  3. Design Principles • For IR Researchers • A programming toolkit, not an IR system • Implements major approaches to IR (Boolean, VSM, Probabilistic & LSI) • Scalable to billions of documents • High performance algorithms and structures • Expandable • Documented: http://ils.unc.edu/tera

  4. Major Components

  5. Implementation • Mostly in C++, using the GNU compiler • Uses the Standard Template Library • Tested on Solaris & Linux (Alpha & 386) • Designed for modularity, so IR researchers can add their own components

  6. Why Might You use IRTools? • If you have your own IR software, there’s probably no need • If you are looking for experimental IR software, this might be a good alternative (goal: to be suitable for general use in mid-2002) • IRTools should be useful for classroom use and demonstration • For production use, consider ht://dig

  7. Design Snippet: Word List • The Berkeley DB is used to store the term  termID lookup table • A single file, accessed by hash in a B+ tree • struct term_termID { char * term irt_int termID}

  8. Design Snippet: 1st Inverted Index File • Binary file with fixed-length records • Accessed by termid*sizeof(struct)offset • Gives basic info needed for weighting • Points to more files for inverted entries (the actual documents for this term) • Some duplication (e.g., meantf) to prevent additional I/O

  9. Design Snippet: 1st Inverted Index File • struct inv_file1 { • irt_int termIDirt_int term_doccount // Frequencyirt_int meantf // For weightingirt_int nt // # terms in this docirt_int file2_location // File for // entriesirt_int starting_offset // File 2 locirt_int entry_count // # occurrences // of this term // in file 2 • }

  10. Design Snippet: 2nd Inverted Index File • Info about documents with this term • Using Page Rank, best docs can be listed earliest (avoiding subsequent disk I/O) • Multiple 2nd files for larger collections • struct inv_file2 {irt_int termID // Sanity checkirt_int file_location // Next fileirt_int starting_offset, num_entries // As for file1}

  11. Design Snippet: 2nd Inverted Index File • For each document with this term: struct inv_docentry { irt_int term_in_doc_count // For weighting: irt_int doc_unique_terms irt_Int doc_total_terms // 3rd file offset irt_int file3_location}

  12. Design Snippet: 3nd Inverted Index File • This lists a term’s locations in documents • irt_int termID // Sanity check • Followed by terms_in_doc_countirt_ints indicating the positions of this term in this document • Usable for a NEAR operator

  13. Current Various stemmers and stoplists Various weighting schemes Sparse matrix formats for LSI etc. Boolean AND & OR TREC output Visual interfaces Designed & Planned Page Rank Integrated spider Boolean NEAR Update & delete entries Concurrent retrieval engine clients Concurrent indexers Planned & Current Components

  14. Global Collection Variables • maxn:highest # of terms in any doc • maxUn: highest # unique terms • Nterms: total known terms • Ndocs: total known documents

  15. Design Snippet: Boolean Candidate Merging • Works for OR or AND • Min. disk I/O (needed for inverted index only) • Doesn’t require inverted index to be sorted in docID order • The STL map can be problematic for more than about 20K candidates; using documents that are Page Rank’ed can help shrink the candidate set (and speed up everything) • Start with terms with the lowest frequency; we only continue until we have enough hits

  16. Design Snippet: Boolean Candidate Merging • irt_int NFULL=0 // stop with enough hitsvector <irt_int> full // docIDs w. all q termsmap <docID, candidate_info> // Candidatesstruct candidate_info { // For each doc irt_int docID // this doc’s ID nt // # terms in this doc for weighting meantf // mean tf in this doc for weighting float [NQUERYTERMS] tf // For weighting irt_short qtcount // # query terms in doc } • The map eliminates sorting! • We must allocate memory for every candidate

  17. Design Snippet: LSI & Information Space • We use a modified Harwell-Boeing sparse matrix format on disk (modified = binary files) • Berry’s svdpackc has been integrated • We’re doing scaling experiments now. Scaling is a major challenge for LSI • One solution: do smaller eigensystem problems on candidate subset on the fly, rather than pre-computing the entire collection’s semantic space. But this eliminates possibly interesting documents!

  18. Hyperlink Map • The hyperlink map is a sparse asymmetric matrix, size is D x D • We use a modified Harwell-Boeing format to store the matrix • A similar index file structure to the inverted index gives us rapid access to any document’s link list • We must store both sides of the matrix

  19. Web Document Metadata • Items stored during spidering. These are kept in a Berkeley DB B+ hash file, with the document URL (or name) as key • Docname // keydocIDHTTP last update as reportedOur last visit/updateHTTP-reported sizeChecksum (simple)# links out

  20. Design Snippet: tokenizer • The tokenizer reads files (via spider or local disk) • Goal: Few passes through the file • Goal: Any character set • Process: • Keep a static array of word boundaries • Keep a static array of tag delimiters (<) • Fold everything to lower case • termID lookup can happen now or later • Simple transformations (like ditching extra white space) can happen now

More Related