1 / 25

Inverted Indexing for Text Retrieval

Inverted Indexing for Text Retrieval. Chapter 4 Lin and Dyer. Introduction. Web search is a quintessential large-data problem. So are any number of problems in genomics. Google, amazon ( aws ) all are involved in research and discovery in this area

Download Presentation

Inverted Indexing for Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer

  2. Introduction • Web search is a quintessential large-data problem. • So are any number of problems in genomics. • Google, amazon (aws) all are involved in research and discovery in this area • Web search or full text search depends on a data structure called inverted index. • Web search problem breaks down into three major components: • Gathering the web content (crawling) (like project 1) • Construction of inverted index (indexing) • Ranking the documents given a query (retrieval) (exam 1)

  3. Issues with these components • Crawling and indexing have similar characteristics: resource consumption is high • Typically offline batch processing except of course on twitter model • There are many requirements for a web crawler or in general a data aggregator.. • Etiquette, bandwidth resources, multilingual, duplicate contents, frequency of changes… • How often to collect: too few may miss important updates, too often may have too much info

  4. Web Crawling • Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input • What are the issues? • See page 67

  5. Retrieval • Retrieval is a online problem that demands stringent timings: sub-second response times. • Concurrent queries • Query latency • Load on the servers • Other circumstances: day of the day • Resource consumption can be spikey or highly variable • Resource requirement for indexing is more predictable

  6. Indexes • Regular index: Document  terms • Inverted index termdocuments • Example: term1  {d1,p}, {d2, p}, {d23, p} term2  {d2, p}. {d34, p} term3  {d6, p}, {d56, p}, {d345, p} Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)

  7. Inverted Index • Inverted index consists of postings lists, one associated with each term that appears in the corpus. • <t, posting>n • <t, <docid, tf> >n • <t, <docid, tf, other info>>n • Key, value pair where the key is the term (word) and the value is the docid, followed by “payload” • Payload can be empty for simple index • Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank of the doc, etc. • <t2, <d1, d4, d67, d89>> • <t3, <d4, d6, d7, d9, d22>> • Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks.

  8. Retrieval • Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs. • The docs are ranked and top k docs are listed. • It is good to have the inverted index in memory. • If not , some queries may involve random disk access for decoding of postings. • Solution: organize the disk accesses so that random seeks are minimized.

  9. Pseudo Code Pseudo code  Baseline implementation  value-key conversion pattern implementation…

  10. Inverted Index: Baseline Implementation using MR • Input to the mapper consists of docid and actual content. • Each document is analyzed and broken down into terms. • Processing pipeline assuming HTML docs: • Strip HTML tags • Strip Javascript code • Tokenize using a set of delimiters • Case fold • Remove stop words (a, an the…) • Remove domain-specific stop works • Stem different forms (..ing, ..ed…, dogs – dog)

  11. Baseline implementation procedure map (docid n, doc d) H  new Associative array for all terms in doc d H{t}  H{t} + 1 for all term in H emit(term t, posting <n, H{t}>)

  12. Reducer for baseline implmentation procedure reducer( term t, postings[<n1, f1> <n2, f2>, …]) P  new List for all posting <a,f> in postings Append (P, <a,f>) Sort (P) // sorted by docid Emit (term t, postings P)

  13. Shuffle and sort phase • Is a very large group by term of the postings • Lets look at a toy example • Fig. 4.3 some items are incorrect in the figure

  14. Baseline MR for II class Mapper procedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P)

  15. Revised Implementation • Issue: MR does not guarantee sorting order of the values.. Only by keys • So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory. • Lets check a revised solution • (term t, posting<docid, f>) to • (term<t,docid>, tf f)

  16. Inverted Index: Revised implementation • From Baseline to an improved version • Observe the sort done by the Reducer. Is there any way to push this into the MR runtime? • Instead of • (term t, posting<docid, f>) • Emit • (tuple<t, docid>, tf f) • This is our previously studied value-key conversion design pattern • This switching ensures the keys arrive in order at the reducer • Small memory foot print; less buffer space needed at the reducer • See fig.4.4

  17. Modified mapper Map (docid n, doc d) H  new AssociativeArray For all terms t in doc H{t}  H{t} + 1 For all terms in H emit (tuple<t,n>, H{t})

  18. Modified Reducer Initialize tprev 0 P  new PostingList method reduce (tuple <t,n>, tf[f1, ..]) if t # tprev ^ tprev # 0 { emit (term t, posting P); reset P; } P.add(<n,f>) tprev  t Close emit(term t, postings P)

  19. Improved MR for II class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tfH[t]) class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf[f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close

  20. Other modifications • Partitionerand shuffle have to deliver all related <key, value> to same reducer • Custom partitioner so that all terms t go to the same reducer. • Lets go through a numerical example

  21. What about retrieval? • While MR is great for indexing, it is not great for retrieval.

  22. Index compression for space • Section 4.5 • (5,2), (7,3), (12,1), (49,1), (51,2)… • (5,2), (2,3), (5,1), (37,1), (2,2)…

  23. Miscellaneous Stuff • How to MR Spam Filtering (Naïve Bayes solution) discussed in Ch.4 DDS? In training the model. • Write solution in the form of your main workflow configuration. • Prior is What is random probability of x occurring? Eg. What is the probability that the next person who walks into the class is a female?

  24. NIH Solicitation in Big Data (2014) • .. • This opportunity targets four topic areas of high need for researchers working with biomedical Big Data, 1. Data Compression/Reduction 2. Data Provenance 3. Data Visualization 4. Data Wrangling

  25. Odds Ratio Example from 4/16/2014 news article • Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1….. • How to interpret this? • = • = • = • Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. =

More Related