1 / 13

Big, Bigger Biggest

Big, Bigger Biggest. Large scale issues: Phrase queries and common words OCR. Tom Burton West Hathi Trust Project. Hathi Trust Large Scale Search Challenges. Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.)

kuri
Download Presentation

Big, Bigger Biggest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project

  2. Hathi Trust Large Scale Search Challenges • Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) • Challenges: • Must scale to 20 million full-text volumes • Very long documents compared to most large-scale search applications • Multilingual collection • OCR quality varies

  3. Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index

  4. Response time varies with query Average: 673 Median: 91 90th: 328 99th: 7,504

  5. Slowest 5 % of queries • The slowest 5% of queries took about 1 second or longer. • The slowest 1% of queries took between 10 seconds and 2 minutes. • Slowest 0.5% of queries took between 30 seconds and 2 minutes • These queries affect response time of other queries • Cache pollution • Contention for resources • Slowest queries are phrase queries containing common words

  6. Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as “the” can be many GB in size This causes lots of disk I/O . Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache

  7. Slow Queries Slowest test query: “the lives and literature of the beat generation” took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query.

  8. Why not use Stop Words? The word “the” occurs more than 4 billion times in our 1 million document index. Removing “stop” words (“the”, “of” etc.) not desirable for our use cases. Couldn’t search for many phrases “to be or not to be” “the who” “man in the moon” vs. “man on the moon” Stop words in one language are content words in another language German stop words “war” and “die” are content words in English English stop words “is” and “by” are content words (“ice” and “village”) in Swedish

  9. “CommonGrams” Ported Nutch “CommonGrams” algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: “The lives and literature of the beat generation” “the-lives” “lives-and” “and-literature” “literature-of” “of-the” “the-beat” “generation”

  10. Standard index vs. CommonGrams Standard Index Common Grams

  11. Comparison of Response time (ms)

  12. Other issues • Analyze your slowest queries • We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list. • We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked. • We discovered that words such as “l’art” were being split into two token phrase queries. • We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit.

  13. Other issues • We broke Solr … temporarily • Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms • Solr/Lucene index size was limited to 2.1 Billion unique terms • Patched: Now it’s 274 Billion • Dirty OCR is difficult to remove without removing “good” words. • Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon.

More Related