180 likes | 280 Views
Compressed Data Structures for Annotated Web Search. Soumen Chakrabarti Sasidhar Kasturi Bharath Balakrishnan Ganesh Ramakrishnan Rohit Saraf. Searching the annotated Web. Search engines increasingly supplement “ten blue links” using Web of objects From object catalogs like
E N D
Compressed Data Structures for Annotated Web Search Soumen ChakrabartiSasidhar KasturiBharath BalakrishnanGanesh RamakrishnanRohit Saraf
Searching the annotated Web • Search engines increasingly supplement “ten blue links” using Web of objects • From object catalogs like • WordNet: basic types and common entities • Wikipedia: millions of entities • Freebase: tens of millions of entities • Product catalogs, LinkedIn, IMDB, Zagat … • Several new capabilities required • Recognizing and disambiguating entity mentions • Indexing these mentions along with text • Query execution and entity ranking
Lemmas and entities • In (Web) text, noisy and ambiguous lemmas are used to mention entities • Lemma = word or phrase • Lemma-to-entity relation is many-to-many • Goal: given mention in context, find correct entity in catalog, if any • Lemma also called “leaf” because we use a trie to detect mention phrases Michael Basketball player Michael Jordan Berkeley professor Jordan Country River Big Apple City that never sleeps New York City A state in USA New York Lemmas Entities
Features for disambiguation After the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods. nonparametric workshop Bayesian tutorial after UNC Feature vectors x Millions of features After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks. leap slam dunk league season
Inferring the correct entity • Each lemma is associated with a set of candidate entities • For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors • When deployed to resolve an ambiguity about lemma ℓ, choose Linear model; dot product
The ℓ, f, e w map • Uncompressed key, value takes 12+4 bytes = 128 bits per entry • ~500M entries 8GB just for map • No primitive type to hold keys • With Java overheads, easily 20GB RAM • From ~2M to ~100M entities? • Total marginal entropy: 33.6 bits per entry • From 128 down to 33.6 and beyond? • Must compress keys and values • And exploit correlations between them
(ℓ, f ) {e w} organization • When scanning documents for disambiguation, we first encounter lemma ℓ and then features f from context around it • Initialize score accumulator for each candidate entity e • For each feature f in context • Probe data structure with (ℓ, f ) • Retrieve sparse map {e w} • For each entry in map • Update entity scores • Choose top candidate entity ℓ1 f1 {ew} f2 {ew} f3 {ew} f4 {ew} “LFE map” or LFEM “LFE map” or LFEM
Millions of entities globally but few for a given lemma Use variable length integer codes Frequent short ID has shortest code Short entity IDs 0 Basketball player 1 CBS, PepsiCo, Westinghouse exec Machine learning researcher 2 MichaelJordan Mycologist 3 Candidate entitiessorted by decreasing occurrence frequency in reference corpus Racing driver 4 Lemma 5 Goalkeeper Short entity IDs wrt lemma
Encoding of (ℓ, f ){e w} e= short ID ℓ1 f1 f2 • We used code,others may be better • For adjacent short IDs,we spend only one bit • Irregular sizes record • Must read frombeginning todecompress Index into start of segmentfor each lemma ID ℓ2
Random access on (ℓ, f ) • Already support random access on ℓ • Number of distinct ℓ in O(10 million) • Cannot afford time to decompress from the beginning of ℓ block • Cannot afford (full) index array for (ℓ, f ) • Within each ℓ block, allocate sync points • Old technique in IR indexing • New issues: • Outer allocation of total sync among ℓ blocks • Tuning syncs to measured (ℓ, f ) probe distribution — inner allocation
Inner sync point allocation policies • Say Kℓsync points budgeted to lemma ℓ • To which features can we seek? • For others, sequential decode • DynProg: optimal expected probetime with dynamic program • Freq: allocate syncs at f withlargest probe prob. p(f |ℓ) • Equi: measure off segmentswith about equal number of bits • EquiAndFreq: split budget f1 f2 f3 f4
Outer allocation policies • Given overall budget K, how many syncs Kℓ does leaf get? • Hit probpℓ, bits in leaf segment bℓ • Analytical expression for effect of inner allocation can be intractable • Hit:Kℓ pℓ • HitBit: Kℓ pℓbℓ • SqrtHitBit: Assume equispaced inner allocation (see Managing Gigabytes)
Experiments • 500 million pages, mostly English, spam-free • Catalog has about two million lemmas and entities Testfold Trainfold Trainfold Trainfold Trainfold Testcontexts Testcontexts Testcontexts Testcontexts ℓ,fworkload ℓ,fworkload ℓ,fworkload ℓ,fworkload Spotter “Reference” Testfold Trainfold Testfold Traincontexts Traincontexts ℓ,fworkload Spotter Sampler Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data Smoother ℓ,f(e,w)model Disambiguation trainer and cross-validator Smoothed ℓ,fdistribution Corpus Compressor L-F-E map Annotationindex Entity andtypeIndexer “Payload” Annotator
Inner policies compared • Equi close to optimal DynProg but fast to compute • Freq surprisingly bad: long tail • Blending Equi and Freq worse than Equi alone • Relative order stable as sample size increased: long tail again Lookup cost: lower is better
Diagnosis: Freq vs. Equi Note scales • Plots show cumulative seek cost starting at sync • Collapse back to zero at next sync • Features with largest frequency not evenly placed • Tail features in between lead to steep seek costs • Equi never lets seek cost get out of hand • (How about permuting features? See paper)
Outer policies compared Probe cost • Inner policy set to best (DynProg) • SqrtHitBit better than Bit better than HitBit • Not surprising, given DynProg behaves closer to Equi than Freq Sync budget
Comparison with other systems • Downloaded software or network services • Regression removes per-page, per-token overhead • LFEM wins, largely because of syncs • LFEM RAM << downloaded software
Conclusion • Compressed in-memory multilevel maps for disambiguation • Random access via tuned sync allocation • >20 GB down to 1.15 GB • Faster than public disambiguation systems • Annotate 500M pages with 2M Wikipedia entities + index on 408 cores in ~18 hours • Also in the paper: design of compressed annotation index posting list