Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar. Efficient Search in Semi-structured Data Spaces. General Approach.
Ralf Andreas Katja Steffen Tom Aleksandar Schenkel Broschart Hose Metzger Crecelius Stupar Efficient Search inSemi-structured Data Spaces
General Approach Model & Ranking(Probabilistic, Language Models, Authority, …) Efficient algorithms Evaluation ofresult quality Evaluation ofexecution cost Problem(Information need on some data collection)
Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left
Text Retrieval Problem:Find the best documents d from a large collectionthat match a query {t1,…,tn} Modeling and ranking:Define score for documents Importance of t in the collection(the less frequent, the better) Importance of t for document d(the more frequent, the better) Linear combination for query scores tf(d,t): frequency of tag t for doc d df(t): #docs tagged with t
What about efficiency? • Cannot compute this from scratch for each query(>>1010 documents) • Solution: • Precompute per-term scores for each document • For each term, store list of (d,score(d,t)) on disk • When query arrives: • combine entries from lists • sort results • return top-k (merge-then-sort algorithm)
Family of Threshold Algorithms T: 0.99 G: 0.77 B: 0.51 A: 0.15 D: 0.01 decreasing score But: Lists can be very long (millions of entries) Simple merge-then-sort algorithm too expensive Observation: „Good" results have high scores • Order lists by decreasing scores • Have „intelligent" algorithmwith different list access modesand early stopping
Experiments with TREC Benchmark • TREC Terabyte collection:~24 million docs from .gov domain,~420GB (unpacked) size(we now have one with 109 docs, 5TB compressed size) • 50 keyword queries from TREC Terabyte 2005 • Performance measures: • Number of sequential and random accesses • Weighted cost: #SA + C · #RA • Wall-clock runtime
Experiments: (TA and) CA on TREC average abstract cost average wallclock runtime 250 4,000,000 State-of-the-art-1 State-of-the-art-2 merge-then-sort merge-then-sort State-of-the-art-1 average running time (milliseconds) average cost (#SA + 1000 x #RA) OURS 100 OURS lower bound 0 0 10 50 100 200 500 10 50 100 200 500 k k • Lower bound: for each query [VLDB06, with H. Bast] • compute top-k results R and final mink • find minimum over all combinations of scan depths that see R • SA cost + RA cost for candidates with bestscore>mink • considers blocks of entries for tractability You can safely ignore this part
Beyond Exact Top-K Results • Improve performance by considering approximate results with probabilistic guarantees • drop candidate when probability for being top-k result is <ε • estimate probabilities from per-list score distributions • reasonable improvement in performance (stop earlier) • probabilistic guarantee: E[relative recall @ k] = 1- • Maximize result quality within fixed budget for execution cost (number of accesses, time) • adaptive scheduling: initially prefer high scores,later high score drops • Experimental results close to optimal (offline) results [VLDB04] [ICDE09]
Even More Heuristics: Proximity • Observation: [SPIRE07] „Good" results have term matches close together add second type of list:for each term pair, include documents with close occurrences of the terms, ordered by distance-based score TL(pianist) TL(french) CL(french, pianist) A:9.3 F:9.1 B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) B:8.6 T:7.2 A:5.9 E:5.0 T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7) D:4.6 B:4.5
Query Processing top-k results sort merge join Prune and reorganize index lists B:(3.0,8.6,4.5) A:9.3 A:5.9 F:(0.7,9.1,1.5) B:4.5 B:8.6 ascending did G:(0.2,2.0,1.7) E:5.0 D:4.6 T:(0.5,3.0,7.2) F:9.1 T:7.2 Observation: very small prefixes of the lists yield good results TL(french) TL(pianist) A:9.3 F:9.1 T:7.2 B:8.6 descending score E:5.0 A:5.9 B:4.5 D:4.6 CL(french, pianist) • Parameters tuned through exhaustive searchin the parameter space(4h on 80-core Hadoop-cluster) • Resulting index approx. as large as the collection B:(3.0,8.6,4.5) F:(0.7,9.1,1.5) T:(0.5,3.0,7.2) descending score G:(0.2,2.0,1.7)
Evaluation at INEX 2009 • Standard benchmark for XML retrieval • 2.6 million XML documents with semantic annotation from YAGO • 113 human-defined queries, 75 come with list of relevant results • Explicit efficiency task
Runtime vs. Quality at INEX 2009
Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen
Querying Social Tagging Networks travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter harrypotter harrypotter harrypotter probabilitydata miningfoundations
Information Need 1: Globally Popular travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations harry potter Most frequently tagged items „best"Tags by all users equally important
Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel
Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Tags by users with similar tags/items(„brothers in spirit")more important
Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability
Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Tags by closely related and well-known users more important
Towards Social-Aware Social Search • Search results may depend on • Global popularity of items • Spiritual context of the querying user(users with similar books and/or tags) • Social context of the querying user(known and trusted friends) • Combinations • Users can have differentimportance(„friendship strengths") in different searches importance of user is convex combination of the three weights (with params α,β)
Prototype [VLDB/SIGIR 2008 demo] results of global search for „dragon"
Prototype [VLDB/SIGIR 2008 demo] results of social search for „dragon"
Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • 49 queries like „mystery magic", „wizard", „yakuza" • Crawled (part of) LibraryThing: ~1.3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friend links • Measured NDCG[10] (weighted precision@10) (spiritual) α(social) • Result quality generally very high • Combination of spiritual and social friends significantly better than pure global search
Algorithmic Overview • Input: query q={t1…tn} for user u, α, • Output: k items with highest scores + „harry potter" ……………………..
Can we reuse Threshold Algorithms here? No, scores specific to querying user and parameter setting! : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=0.0,=1.0) : harry (=0.0,=0.8) : harry (=0.5,=0.5) : harry (=1.0,=0.0) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) 0.98 0.98 0.98 0.98 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.84 0.84 0.84 0.84 0.89 0.89 0.86 0.89 0.89 0.89 0.89 0.89 0.86 0.89 0.86 0.89 0.89 0.89 0.89 0.86 0.45 0.45 0.45 0.45 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.56 harry travel 0.87 0.95 0.82 0.85 0.69 0.51 Number of lists to precompute would explode!(#tags #users parameter space)
Top-K in Social Networks: ContextMerge [SIGIR 2008] Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • USERITEMS(u',t): pairs <i,tfu'(i,t)>, unsorted • FRIENDS(u): pairs <u',F(u,u')>, sorted by F(u,u')↓ ITEMS(harry): alreadyexist insystems 32 26 47 … USERITEMS( , harry): 1 FRIENDS( ): 0.085 0.12 0.10 …
Experimental Evaluation: Efficiency • Testbed: 3 large crawls of real social networks • Flickr: 10 mio pictures, ~50,000 users • Del.icio.us: ~175,000 bookmarks, ~12,000 users • Librarything: ~6.5 mio books, ~10,000 users • Queries: • 150 frequent tag pairs • for each query pick user with „enough" results & friends • Abstract cost measure disk load • Baseline: full merge + sort
Experimental Evaluation: Efficiency (=0) 2-8 times better than baseline α
Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen
WisNetGrid: Semantic Search for D-Grid • D-Grid: German science grid providing computing and storage resources • Many topic-specific communities: Astro-, Text-, Medi-, Interlog-, Wiss- (Science-), Finance-, … • Two services missing so far (among others): • Integrated search over all data sources • Extraction of facts from data and fact-based search WisNetGrid: BMBF project with 10 national partners
Selected Projects • Efficient Information Retrieval • Social Networks • Distributed Knowledge Management • Whatever is left MMCI Retreat, Braunshausen
Whatever is left • Everlast: Distributed Web Archiving (with A. Anand, S. Bedathur, MPI-INF) • IR on knowledge graphs (with S. Elbassuoni, M. Ramanath, G. Weikum, MPI-INF) • Summarization of knowledge about entities (with M. Sydow, U Warsaw, PL) • Assessments for XML IR with Amazon Mechanical Turk (with O. Alonso, Microsoft, and M. Theobald, MPI-INF) • INEX Efficiency Task (with M. Theobald, MPI-INF, and A. Trotman, U Otago, NZ)