Adding Ranking to DB (= Adding “Semantics” to IR)

Adding Ranking to DB(= Adding “Semantics” to IR) Keyword Search on Relational Graphs (BANKS, Discover, DBexplorer, …) IR Systems Search Engines Unstructured search (keywords) + Digital Libraries + Enterprise Search + Web 2.0 DB Systems Structured search (SQL, XQuery) Querying entities & relations from IE (Libra, ExDB, NAGA, … ) + Text + Relax. & Approx. + Ranking Structured data (records) Unstructured data (documents) Trend: quadrants getting blurred towards DB&IR technology integration

Why DB&IR ? – Application Needs Simplify life for application areas like: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer support & CRM in insurances, telcom, retail, software, etc. • Bulletin boards for social communities • Enterprise search for projects, skills, know-how, etc. • Personalized & collaborative search in digital libraries, Web, etc. • Comprehensive archive of blogswith time-travel search Typical data: Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … ) Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …) Typical query: symptoms of tropical virus diseases and reported anomalies with young patients in central Europe in the last two weeks

DB Tag Cloud Probability Theory Information Extraction Statistics Text Mining Pay-As-You-Go End-Users Ranking Schema-Free Dataspaces Heterogeneity Uncertainty Record Linkage E-Science Schema Evolution Lineage Data Integration Focus on Programmer Schema Structure Logic XQuery Transactions Database XML Scalability Workflows SQL Indexing XQuery Sensors Algebra Top-k Query Execution Plan Threshold Algorithm Query Optimizer Streams Skyline Selectivity Estimation Query Rewriting Similarity Search Hash Sketches Histogram Join Order Magic Sets

Outline • DB & IR Motivation 1: Text Matching • DB & IR Motivation 2: Too-Many-Answers Problem • DB & IR Motivation 3: Schema Relaxation • DB & IR Motivation 4: Information Extraction & Entity Search

DB & IR Motivation 1: Text Matching • Add keyword search to structured data (relations) • or semistructured data (XML trees) • Combine this with structured predicates (SQL, XPath) • Add scoring and ranking • (tf*idf, Prob IR, … + PR or HITS when applicable)

WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Scoring and ranking: xj ~ tf (word j in x)  idf (word j) s (<x,y>, q: A~B) = cosine (x.A, y.B) with dampening & normalization s (<x,y>, q1 …  qm) = Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment Movies Reviews Title Plot … Year Title Comment … Rating 1999 Matrix 1 … cool fights … new techniques … Matrix In the near future … computer hacker Neo … … fight training … … 4 • DB&IR for query-time data integration • More recent work: MinorThird, Spider, DBLife, etc. • But scoring models fairly ad hoc Matrix Reloaded … fights … and more fights … … fairly boring … 1 Hero 2002 In ancient China … fights … sword fight … fights Broken Sword … Matrix Eigenvalues … matrix spectrum … orthonormal … 5 2004 Shrek 2 In Far Far Away … our lovely hero fights with cat killer … Ying xiong aka. Hero … fight for peace … … sword fight … dramatic colors … 5

XML Search for Higher Precision Keyword query: Madonna child Entity-Relation query: Person x = „Madonna“ & HasChild (x,y) Semantic XPath Full-Text query: /Article [ftcontains(//Person, ”Max Planck“)] or even: /Article [ftcontains(//Person, ”Max Planck“)] [ftcontains(//Work, ”quantum physics“)] //Children[@Gender = ”female“]//Birthdates Keyword query:Max Planck or or Concept query: Person = „Max Planck“ should ideally be automatically inferred from keyword query + user background and implicit feedback

XML Text Search with Result Ranking //Professor [//Country”Germany“)] [//Course”IR“)] [//Research”XML“)] Which professors from Germany give courses on IR and run projects on XML? • Query predicates: • tag-term conditions • (subtree contains terms) • Scoring and ranking: • element-specific • tf*idf or BM25for • content conditions • (tag-term scores) • score aggregation with • prob. independence • extended TA for • fast query execution Professor Name: Gerhard Weikum Address ... City: SB Research Teaching Country: Germany ... Course Project Title: Intelligent Search of Heterogeneous XML Data Title: IR Syllabus Description: Information retrieval ... ... Book Article ... ... Funding: EU

From Tables and Trees to Graphs [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Schema-agnostic keyword search over multiple tables: graph of tuples with foreign-key relationships as edges Example: Conferences (CId, Title, Location, Year) Journals (JId, Title) CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person) Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95 • Related use cases: • XML beyond trees • RDF graphs • ER graphs (e.g. from IE) • social networks Result is connected tree with nodes that contain as many query keywords as possible QP over relational DB: exploit schema, generate meaningful joins Ranking: with nodeScore based on tf*idf or prob. IR and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)

Keyword Search on Graphs: Semantics member citizen member member EU Subtleties of Interconnection Semantics [S. Cohen et al. 2005, B. Kimelfeld et al. 2007] trustee Martin Kersten Amsterdam director city country EDBT School member CWI Netherlands citizen Arjen de Vries content speaker VLDB Endowment city trustee program Gerhard Weikum speaker director city Bolzano MPII Saarbrücken Italy country country citizen Germany Paolo Atzeni trustee • Variations: • directed vs. undirected graphs, strict vs. relaxed • conditions on nodes, conditions on edges (node pairs) • all conditions mandatory or some optional • dependencies among conditions

DB & IR Motivation 2: Too-Many-Answers Problem Precise queries yield too many or too few results (actually indicating uncertainty in the search goal) • Search paradigms: • top-k, n-nearest-neighbors, score aggregation • preference search for skylines • Top-k results from ranked retrieval on • product catalog data: aggregate preference scores for • properties such as price, rating, sports facilities, beach type, etc. • multimedia data: aggregate similarity scores for color, shape, etc. • text, XML, Web documents: tf*idf, authority, recency, spamRisk, etc. • Internet sources: aggregate properties from distributed sources • e.g., site1: restaurant rating, site2: prices, site3: driving distances, etc. • social networks:ranking based on friends, tags, ratings, etc.

Probabilistic Ranking for SQL [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06] SQL queries that return many answers need ranking • Examples: • Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …) • Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“) • Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …) • Select * From Movies Where Genre = ”Romance“ And Era = ”90s“ odds for tuple d with attributes XYrelevant for query q: X1=x1… Xm=xm Estimate prob‘s, exploiting workload W: • Example: frequent queries • … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“ • … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“ • boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“

Threshold Algorithm (TA) [Fagin 01, Güntzer 00, Nepal 99] Rank Doc Score Rank Doc Score Rank Doc Score 1 d10 2.1 1 d10 2.1 1 d10 2.1 Rank Doc Score 2 d78 1.5 2 d78 1.5 2 d78 1.5 1 d10 2.1 2 d64 0.9 1 d78 1.5 1 d78 1.5 2 d78 1.5 2 d64 1.2 simple & DB-style; needs only O(k) memory; for monotone score aggr. Threshold algorithm (TA): scan index lists; consider d at posi in Li; highi := s(ti,d); if d  top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m}; if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’  top-k}; threshold := aggr {high | =1..m}; if thresholdmin-k then exit; Data items: d1, …, dn d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query: q = (t1, t2, t3) aggr: summation Index lists k = 2 Rank Doc Score d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 … Scan depth 2 Scan depth 3 Scan depth 4 1 d78 0.9 d64 0.9 d23 0.6 d10 0.6 d12 0.2 d78 0.1 t2 … 2 d64 0.9 d10 0.7 d78 0.5 d64 0.3 d99 0.2 d34 0.1 t3 … STOP!

TA with Sorted Access Only (NRA)[Fagin 01, Güntzer et al. 01] sequential access (SA) faster than random access (RA) by factor of 20-1000 No-random-access algorithm (NRA): scan index lists; consider d at posi in Li; E(d) := E(d)  {i}; highi := s(ti,d); worstscore(d) := aggr{s(t,d) |  E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high |   E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’  top-k}; else if bestscore(d) > min-k then cand := cand  {d}; threshold := max {bestscore(d’) | d’ cand}; if threshold  min-k then exit; Data items: d1, …, dn d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query: q = (t1, t2, t3) aggr: summation Index lists k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 … Scan depth 2 Scan depth 3 d64 0.8 d23 0.6 d10 0.6 d12 0.2 d78 0.1 t2 … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 t3 STOP! …

History of TA Family and Related Top-k Work Bruno 02 (distr.TA) Cao 04 (TPUT) Michel 05 (KLEE) Balke 05 (P2P top-k) Theobald 05 (XML) Kaushik 04 (XML) Ciaccia 00 (PAC top-k) Theobald 04 (prob. top-k) Bast 06 (sched.) Chang 02 (exp. pred.) Li 06 (ad-hoc) Xin 07 (ad-hoc) Chaudhuri 96 Kersten 02 (PQs) Fagin 97 (A0) Fagin 01 (TA) Fagin 03 Nepal 99 Natsev 01 Güntzer 00 Güntzer 01 Agrawal 03 (many answers) Balke 02 Ilyas 03 Ilyas 05 Moffat 96 Anh 01 Anh 06 Buckley 85 Pfeifer 97 Fuhr 06 Persin 06 time 1985 1995 2000 2005

DB & IR Motivation 3: Schema Relaxation • Traditional DB focus was one DB with perfectly clean data • New focus is many DBs with on-the-fly fusion (partial integration) • Modern apps (mashups etc.) even require many DBs • with non-DB sources (blogs, maps, sensors, etc.), many with • no schema, partial typing, or rapidly evolving schema • More appropriate abstraction is data spaces • (with pay-as-you-go schema) • Calls for schema-free or schema-relaxed querying, • which entails inherent need for ranking

XML IR with Background Ontology alchemist magician primadonna director artist wizard investigator intellectual Related (0.48) researcher professor Hyponym (0.749) scientist Lecturer scholar lecturer mentor academic, academician, faculty member Name: Ralf Schenkel teacher Activities Address: Saarland University, D-66123 Saarbrücken Seminar Scientific Other … Contents: Future of Web search … Name: INEX task coordinator (Initiative for the Evaluation of XML …) Sponsor: EU • Query expansion for tags & terms based on • related concepts in ontology/thesaurus & • strength of relatedness (or correlation) •  robustness and efficiency issues Literature: … //Professor [//Country”Germany“)] [//Course”IR“)] [//Research”XML“)] Professor Name: Gerhard Weikum Address ... City: SB Research Teaching Country: Germany ... Course Project Title: Intelligent Search of Heterogeneous XML Data Title: IR Syllabus Description: Information retrieval ... ... Book Article ... ... Funding: EU

XML: Structural Similarity Structural similarity and ranking based on tree edit distance (FleXPath, Timber, …) Query 1: //movie [ftcontains(/plot/location, ”Tibet“)] Query 2: //movie [ftcontains(//plot, ”happily ever after“)] //actor actor movie movie movie actor director actor plot plot director location location location Score = min. relaxation cost to unify structures of query and data trees based on node insertion, node deletion, node generalization, edge generalization, subtree promotion, etc.

DB & IR Motivation 4: Information Extraction & Entity Search • Best content producer (rate*quality) is text and speech • (scientific literature, news, blogs, photo/video tags, etc.) • Extract entities, attributes, relations from text • Build knowledge bases as ER graphs (with uncertainty) • Move from keyword search to the level of • entity/relation querying • Most likely already in use by leading search engines for • for attractive vertical domains (travel, electronics, entertainment)

Entity Search: Example Google Which politicians are also scientists ? • What is lacking? • data is not knowledge •  extraction and organization • keywords cannot express • advanced user intentions •  concepts, entities, properties, • relations

Information Extraction (IE): Text to Records Person BirthDate BirthPlace ... Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Constant Value Dimension Planck‘s constant 6.2261023 Js Person Organization Max Planck KWG / MPG • extracted facts often • have confidence < 1 • DB with uncertainty (probabilistic DB) combine NLP, pattern matching, lexicons, statistical learning

Entity Reconciliation (Fuzzy Matching, Entity Matching/Resolution, Record Linkage) • Problem: • same entity appears in • different spellings (incl. mis-spellings, abbr., multilingual, etc.) • e.g. Brittnee Speers vs. Britney Spears, M-31 vs. NGC 224, • Microsoft Research vs. MS Research, Rome vs. Roma vs. Rom • different levels of completeness • e.g. Joe Hellerstein (UC Berkeley) vs. Prof. Joseph M. Hellerstein, CA • Larry Page (born Mar 1973) vs. Larry Page (born 26/3/73) • Microsoft (Redmond, USA) vs. Microsoft (Redmond, WA 98002) • different entities happen to look the same • e.g. George W. Bush vs. George W. Bush, Paris vs. Paris • Problem even occurs within structured databases and • requires data cleaning when integrating multiple databases • Current approaches are: • edit distance measures, context consideration, reference dictionaries, • statistical learning in graph models, etc.

Entity-Search Ranking with LM[Z. Nie et al.: WWW 2007; see also T. Cheng: VLDB 2007] Standard LM for docs with background model (smoothing): Assume entity e was seen in k records r1, …, rk extracted from k pages d1, …, dk with accuracy1, …, k record-level LM with context window around ri in di (default: only ri itself) alternatively consider individual attributes e.aj with importance j extracted from page di with accuracy ij

Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007] • EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.): • define authority transfer graph • among entities and pages with edges: • entity  page if entity appears in page • page  entity if entity is extracted from page • page1  page2 if there is hyperlink or implicit link between pages • entity1  entity2 if there is a semantic relation between entities • edges can be typed and (degree- or weight-) normalized and • are weighted by confidence and type-importance • also applicable to graph of DB records with foreign-key relations • (e.g. bibliography with different weights of publisher vs. location for conference record) • compared to standard Web graph, ER graphs of this kind • have higher variation of edge weights

YAGO Knowledge Base from Wikipedia & WordNet Entities Facts KnowItAll 30 000 SUMO 20 000 60 000 WordNet 120 000 800 000 Cyc 300 000 5 Mio. TextRunner n/a 8 Mio. YAGO 1.7 Mio. 15 Mio. DBpedia 1.9 Mio. 103 Mio. Entities & Relations Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist instanceOf instanceOf Erwin_Planck Nobel Prize bornIn Kiel hasWon FatherOf individuals diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means “Max Karl Ernst Ludwig Planck” “Dr. Planck” “Max Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Graph IR on Knowledge Bases Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness discovery queries hasWon diedOn Nobel prize $a $x bornIn isa Kiel $x scientist > hasSon diedOn $y $b connectedness queries isa * German novelist Thomas Mann Goethe queries with regular expressions hasFirstName | hasLastName isa Ling $x scientist (coAuthor | advisor)* worksFor locatedIn* $y Zhejiang Beng Chin Ooi

Ranking Factors • Confidence: • Prefer results that are likely to be correct • Certainty of IE • Authenticity and Authority of Sources bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) • Informativeness: • Prefer results that are likely important • May prefer results that are likely new to user • Frequency in answer • Frequency in corpus (e.g. Web) • Frequency in query log q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) • Compactness: • Prefer results that are tightly connected • Size of answer graph vegetarian Tom Cruise isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn

Entity-Relation Search: Example NAGA Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel …

Major Trends in DB and IR Database Systems Information Retrieval malleable schema (later) deep NLP, adding structure record linkage info extraction graph mining entity-relationship graph IR dataspaces Web entities ontologies statistical language models data uncertainty ranking programmability search as Web Service Web 2.0 Web 2.0

Thank You !

Adding Ranking to DB (= Adding “Semantics” to IR)