1 / 22

Compressed Data Structures for Annotated Web Search

Compressed Data Structures for Annotated Web Search. Soumen Chakrabarti Sasidhar Kasturi Bharath Balakrishnan Ganesh Ramakrishnan Rohit Saraf http://soumen.in/doc/CSAW/. Searching the annotated Web. Search engines increasingly supplement “ten blue links” using Web of objects

ardice
Download Presentation

Compressed Data Structures for Annotated Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed Data Structures for Annotated Web Search Soumen Chakrabarti Sasidhar KasturiBharath BalakrishnanGanesh RamakrishnanRohit Saraf http://soumen.in/doc/CSAW/

  2. Searching the annotated Web • Search engines increasingly supplement “ten blue links” using Web of objects • From object catalogs like • WordNet: basic types and common entities • Wikipedia: millions of entities • Freebase: tens of millions of entities • Product catalogs, LinkedIn, IMDB, Zagat … • Several new capabilities required • Recognizing and disambiguating entity mentions • Indexing these mentions along with text • Query execution and entity ranking

  3. Lemmas and entities • In (Web) text, noisy and ambiguous lemmas are used to mention entities • Lemma = word or phrase • Lemma-to-entity relation is many-to-many • Goal: given mention in context, find correct entity in catalog, if any • Lemma also called “leaf” because we use a trie to detect mention phrases Michael Basketball player Michael Jordan Berkeley professor Jordan Country River Big Apple City that never sleeps New York City A state in USA New York Lemmas Entities

  4. Features for disambiguation After the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods. nonparametric workshop Bayesian tutorial after UNC Feature vectors x Millions of features After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks. leap slam dunk league season

  5. Inferring the correct entity • Each lemma is associated with a set of candidate entities • For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors • When deployed to resolve an ambiguity about lemma ℓ, choose Linear model; dot product

  6. The ℓ, f, e w map • Uncompressed key, value takes 12+4 bytes = 128 bits per entry • ~500M entries  8GB just for map • No primitive type to hold keys • With Java overheads, easily 20GB RAM • From ~2M to ~100M entities? • Total marginal entropy: 33.6 bits per entry • From 128 down to 33.6 and beyond? • Must compress keys and values • And exploit correlations between them

  7. Lossy encoding: signed hash To insert: ±1 Hash function #1 Hash function #2 * Accumulate ±w into bucket Hashbuckets • No need to remember ℓ, f, e • w cannot be easily compressed (all buckets same size for easy hash index) • Sign hash ensures expected values preserved • Value distortion and disambiguation accuracy

  8. “Training through the collisions” • Linear multiclass SVM • Each class e has model vector we • From spot generate feature vector x • Predicted class (entity) is • Sign hash space with B buckets • Map • Predicted class is •  loses information, SVM training compensates for it • Essential

  9. Lossless (ℓ, f )  {e w} organization • When scanning documents for disambiguation, we first encounter lemma ℓ and then features f from context around it • Initialize score accumulator for each candidate entity e • For each feature f in context • Probe data structure with (ℓ, f ) • Retrieve sparse map {e w} • For each entry in map • Update entity scores • Choose top candidate entity ℓ1 f1 {ew} f2 {ew} f3 {ew} f4 {ew} “LFE map” or LFEM “LFE map” or LFEM

  10. Millions of entities globally but few for a given lemma Use variable length integer codes Frequent short ID has shortest code Short entity IDs 0 Basketball player 1 CBS, PepsiCo, Westinghouse exec Machine learning researcher 2 MichaelJordan Mycologist 3 Candidate entitiessorted by decreasing occurrence frequency in reference corpus Racing driver 4 Lemma 5 Goalkeeper Short entity IDs wrt lemma

  11. Encoding of (ℓ, f ){e w} e= short ID ℓ1 f1 f2 • We used code,others may be better • For adjacent short IDs,we spend only one bit • Irregular sizes record • Must read frombeginning todecompress Index into start of segmentfor each lemma ID ℓ2

  12. Random access on (ℓ, f ) • Already support random access on ℓ • Number of distinct ℓ in O(10 million) • Cannot afford time to decompress from the beginning of ℓ block • Cannot afford (full) index array for (ℓ, f ) • Within each ℓ block, allocate sync points • Old technique in IR indexing • New issues: • Outer allocation of total sync among ℓ blocks • Tuning syncs to measured (ℓ, f ) probe distribution — inner allocation

  13. Inner sync point allocation policies • Say Kℓsync points budgeted to lemma ℓ • To which features can we seek? • For others, sequential decode • DynProg: optimal expected probetime with dynamic program • Freq: allocate syncs at f withlargest probe prob. p(f |ℓ) • Equi: measure off segmentswith about equal number of bits • EquiAndFreq: split budget f1 f2 f3 f4

  14. Outer allocation policies • Given overall budget K, how many syncs Kℓ does leaf get? • Hit probpℓ, bits in leaf segment bℓ • Analytical expression for effect of inner allocation can be intractable • Hit:Kℓ  pℓ • HitBit: Kℓ  pℓbℓ • SqrtHitBit: Assume equispaced inner allocation

  15. Experiments • 500 million pages, mostly English, spam-free • Catalog has about two million lemmas and entities Testfold Trainfold Trainfold Trainfold Trainfold Testcontexts Testcontexts Testcontexts Testcontexts ℓ,fworkload ℓ,fworkload ℓ,fworkload ℓ,fworkload Spotter “Reference” Testfold Trainfold Testfold Traincontexts Traincontexts ℓ,fworkload Spotter Sampler Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data Smoother ℓ,f(e,w)model Disambiguation trainer and cross-validator Smoothed ℓ,fdistribution Corpus Compressor L-F-E map Annotationindex Entity andtypeIndexer “Payload” Annotator

  16. Inner policies compared • Equi close to optimal DynProg but fast to compute • Freq surprisingly bad: long tail • Blending Equi and Freq worse than Equi alone • Relative order stable as sample size increased: long tail again Lookup cost: lower is better

  17. Diagnosis: Freq vs. Equi Note scales • Plots show cumulative seek cost starting at sync • Collapse back to zero at next sync • Features with largest frequency not evenly placed • Tail features in between lead to steep seek costs • Equi never lets seek cost get out of hand • (How about permuting features? See paper)

  18. Outer policies compared Probe cost • Inner policy set to best (DynProg) • SqrtHitBit better than Bit better than HitBit • Not surprising, given DynProg behaves closer to Equi than Freq Sync budget

  19. SignHash, no training through collisions • Build w from separate lossless training • Distorted from SignHash • Most model values severely distorted • Give lossless and SignHash same RAM • Most keys collide • Completely unacceptable accuracy (random guessing is far better)

  20. SignHash, training through collisions • Used PEGASOS stochastic gradient descent for training • 77% of spots have label “NA” (no annotation) • 23% error by choosing NA for all spots • 11% error via lossless LFEM • SignHash given same RAM as LFEM • 18% error via SignHash • Much better than no training • But a lot worse than lossless LFEM • Surprising, given LFEM currently uses plain old naïve Bayes

  21. Comparison with other systems • Downloaded software or network services • Regression removes per-page, per-token overhead • LFEM wins, largely because of syncs • LFEM RAM << downloaded software

  22. Conclusion • Compressed in-memory multilevel maps for disambiguation • Random access via tuned sync allocation • >20 GB down to 1.15 GB • Faster than public disambiguation systems • Annotate 500M pages with 2M Wikipedia entities + index on 408 cores in ~18 hours • Sparse models for better storage? • Also in the paper: design of compressed annotation index posting list

More Related