210 likes | 229 Views
A programming toolkit for IR researchers, implementing major IR approaches with scalable and high-performance algorithms. Available for download and participation. Use it for classroom, demonstration, and mid-2002 general use.
E N D
IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu
Download & Participate • IRTools is a work in progress. Check back in the spring for more software and test cases. Currently, only some parts work • Want to help? We use CVS for distributed development • Our project page: http://sourceforge.net/projects/irtools
Design Principles • For IR Researchers • A programming toolkit, not an IR system • Implements major approaches to IR (Boolean, VSM, Probabilistic & LSI) • Scalable to billions of documents • High performance algorithms and structures • Expandable • Documented: http://ils.unc.edu/tera
Implementation • Mostly in C++, using the GNU compiler • Uses the Standard Template Library • Tested on Solaris & Linux (Alpha & 386) • Designed for modularity, so IR researchers can add their own components
Why Might You use IRTools? • If you have your own IR software, there’s probably no need • If you are looking for experimental IR software, this might be a good alternative (goal: to be suitable for general use in mid-2002) • IRTools should be useful for classroom use and demonstration • For production use, consider ht://dig
Design Snippet: Word List • The Berkeley DB is used to store the term termID lookup table • A single file, accessed by hash in a B+ tree • struct term_termID { char * term irt_int termID}
Design Snippet: 1st Inverted Index File • Binary file with fixed-length records • Accessed by termid*sizeof(struct)offset • Gives basic info needed for weighting • Points to more files for inverted entries (the actual documents for this term) • Some duplication (e.g., meantf) to prevent additional I/O
Design Snippet: 1st Inverted Index File • struct inv_file1 { • irt_int termIDirt_int term_doccount // Frequencyirt_int meantf // For weightingirt_int nt // # terms in this docirt_int file2_location // File for // entriesirt_int starting_offset // File 2 locirt_int entry_count // # occurrences // of this term // in file 2 • }
Design Snippet: 2nd Inverted Index File • Info about documents with this term • Using Page Rank, best docs can be listed earliest (avoiding subsequent disk I/O) • Multiple 2nd files for larger collections • struct inv_file2 {irt_int termID // Sanity checkirt_int file_location // Next fileirt_int starting_offset, num_entries // As for file1}
Design Snippet: 2nd Inverted Index File • For each document with this term: struct inv_docentry { irt_int term_in_doc_count // For weighting: irt_int doc_unique_terms irt_Int doc_total_terms // 3rd file offset irt_int file3_location}
Design Snippet: 3nd Inverted Index File • This lists a term’s locations in documents • irt_int termID // Sanity check • Followed by terms_in_doc_countirt_ints indicating the positions of this term in this document • Usable for a NEAR operator
Current Various stemmers and stoplists Various weighting schemes Sparse matrix formats for LSI etc. Boolean AND & OR TREC output Visual interfaces Designed & Planned Page Rank Integrated spider Boolean NEAR Update & delete entries Concurrent retrieval engine clients Concurrent indexers Planned & Current Components
Global Collection Variables • maxn:highest # of terms in any doc • maxUn: highest # unique terms • Nterms: total known terms • Ndocs: total known documents
Design Snippet: Boolean Candidate Merging • Works for OR or AND • Min. disk I/O (needed for inverted index only) • Doesn’t require inverted index to be sorted in docID order • The STL map can be problematic for more than about 20K candidates; using documents that are Page Rank’ed can help shrink the candidate set (and speed up everything) • Start with terms with the lowest frequency; we only continue until we have enough hits
Design Snippet: Boolean Candidate Merging • irt_int NFULL=0 // stop with enough hitsvector <irt_int> full // docIDs w. all q termsmap <docID, candidate_info> // Candidatesstruct candidate_info { // For each doc irt_int docID // this doc’s ID nt // # terms in this doc for weighting meantf // mean tf in this doc for weighting float [NQUERYTERMS] tf // For weighting irt_short qtcount // # query terms in doc } • The map eliminates sorting! • We must allocate memory for every candidate
Design Snippet: LSI & Information Space • We use a modified Harwell-Boeing sparse matrix format on disk (modified = binary files) • Berry’s svdpackc has been integrated • We’re doing scaling experiments now. Scaling is a major challenge for LSI • One solution: do smaller eigensystem problems on candidate subset on the fly, rather than pre-computing the entire collection’s semantic space. But this eliminates possibly interesting documents!
Hyperlink Map • The hyperlink map is a sparse asymmetric matrix, size is D x D • We use a modified Harwell-Boeing format to store the matrix • A similar index file structure to the inverted index gives us rapid access to any document’s link list • We must store both sides of the matrix
Web Document Metadata • Items stored during spidering. These are kept in a Berkeley DB B+ hash file, with the document URL (or name) as key • Docname // keydocIDHTTP last update as reportedOur last visit/updateHTTP-reported sizeChecksum (simple)# links out
Design Snippet: tokenizer • The tokenizer reads files (via spider or local disk) • Goal: Few passes through the file • Goal: Any character set • Process: • Keep a static array of word boundaries • Keep a static array of tag delimiters (<) • Fold everything to lower case • termID lookup can happen now or later • Simple transformations (like ditching extra white space) can happen now