50 likes | 211 Views
By Boulat Oulmachev Ivan Khrisanov Sergiy Samus. SPIMI Implementation v2.0: Ranking with BM25. Overview of theChanges. In the SPIMI v2.0, we implemented the following major changes: Support for term frequencies
E N D
By BoulatOulmachev Ivan Khrisanov SergiySamus SPIMI Implementation v2.0:Ranking with BM25
Overview of theChanges • In the SPIMI v2.0, we implemented the following major changes: • Support for term frequencies The framework to handle the term-postings list pairs was expanded to accommodate the need to record term frequency. • Ranking Ranking the BM25 ranking module was created to rank the documents and the querying mechanism was reworked accordingly. • Caching docs The mechanism for caching the Reuters documents was significantly optimized by removing the saving of each doc in its own file and providing pointers into the sgm files instead. • Communication between Indexer and Searcher The communication structure was defined to pass the information from the Indexer to the Searcher.
Term Frequencies Re-engineered the inverted Index representation - Old list TermListPair did not provide an easy way to record term frequencies - Needed a scalable, flexible, robust framework - Created new classes to represent the needed abstractions: Index, Term, PostingsList and PostingsList .Entry - The access to all the information is now easy, flexible and fast.
BM25 Ranking • Implemented a module to rank docs using BM25 - created a class BM25Ranker that, given the list of the docIDs and the query terms, computes the ranking for each doc. - Iterate over each supplied docID and look for query terms that contain it in its posting list and apply BM25 formula for these terms to determine doc’s cumulative score. - uses some global statistical data from the new Stats module.
Caching and Stats modules • Caching • The way to retrieve the text of the documents was significantly optimized. • The doc’s contents are no longer stored in the individual text files on disk, instead an indexing file is created, mapping docIDs to pointers into sgm files. • Indexing stages is now 3 to 4 times faster. • Also, disk space is saved. • Communication • Indexer needed to communicate some global stats data such as average doc length, collection size, etc. to the Searcher. • Implemented a module to provide that communication.