Beyond SQL: Structured Data Retrieval by Ranking

Beyond SQL: Structured Data Retrieval by Ranking Chengkai Li October 25, 2006 CS511 guest lecture

About Myself • Advisor: Kevin Chang (UIUC) • Research Interests • Structured Data Retrieval and Ranking Query Processing • Web Information Integration • XML Query Processing • Misc. • http://www.ews.uiuc.edu/~cli • cli [at] uiuc [dot] edu

Boolean Semantics of SQL: Success and Barrier • Example: SELECT * FROM Homes H WHERE 400K<price<600K AND #bedroom = 4 • Problems: • Query Semantics: • True or False, nothing in the middle • What do we want?: similarity, relevance, preference, etc. • Query Results Presentation: • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider • What do we want?: a ranked list

“Google” on Databases:From Boolean Query to Fuzzy Retrieval

Fuzzy Retrieval by Ranking • Ranking (Top-k) Queries • Top k results by a ranking function with multiple criteria • SELECT * FROM Homes H Order By 0.3*big(H.size)+0.7*cheap(H.price) Limit 5

Beyond Boolean Semantics: Ranking Useful and important in many real-world database applications: • E-Commerce Find the best hotel deals by price, distance, etc. • Multimedia Databases Find the most similar images by color, shape, texture, etc. • Text Retrieval and Document Management Find the most relevant records/documents/pages. • OLAP, Decision Support Find the top profitable customers.

Application 1: E-commerce • The home search example • More details later

Application 2: Decision Support • What are the top 5 areas to advertise a new insurance product? SELECT zipcode, AVG(income*w1+age*w2+credit*w3) as score FROM customer WHERE occupation=‘student’ GROUP BY zipcode ORDER BY score LIMIT 5

Application 3: Multimedia Databases • Find the most similar images by color, shape, texture, etc. • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm

Application 4: Keyword Queries • Text data coexist with structured data • IR-style database queries [slides courtesy of Vagelis Hristidis]

Example – Complaints Database Schema

Example - Complaints Database Data Complaints Customers Products

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products

Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1

Great Interest in Database Community • Top-k Queries • Top-k to range query transformation: [ChaudhuriG99] • Indexing: [ChangBC+00], [TsaparasPK+03] • View: [HristidisKP01], [YiYY+03], [DasGK+06] • Query operators and algorithms: [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06] • Ranking algebra: [ChaudhuriRW05] [LiCI+05] • Ranking aggregate queries: [LiCI06] • Other Types of Queries • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02] • IR-style TFIDF ranking: [AgrawalCD+03] • Ranking of attributes: [DasHK+06] • Preference queries: [AgrawalW00], [Kießling02] • Skyline queries: [BorzsonyiSK01] • DBRank07

Theme Life Cycles: Bursting Topics(Current Bursting) [slide courtesy of Chengixang Zhai]

RankSQL: a RDBMS with Efficient Support of Ranking Queries • Algebraic Foundation and Optimization Framework SPJ queries (SELECT … FROM … WHERE … ORDER BY …) • Ad-Hoc Ranking Aggregate Queries top k groups instead of tuples (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)

Example: Trip Planning Suggest a hotel to stay and a museum to visit: Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area Order By cheap(h.price) + close(h.addr, “BWI airport”) + related(m.collection,“dinosaur”) Limit 5 membership dimension: Boolean predicates, Boolean function B order dimension: ranking predicates, scoring function R

5 results R sort cheap+close+related B h.area=m.area σh.star=3 scan(m) scan(h) Processing Ranking Queries in Traditional RDBMS Select * From Hotel h, Museum m Where B Order By R Limit 5

5 results R B Problems of Traditional Approach • NaïveMaterialize-then-Sort scheme • Overkill total order of all results; only 5 top results are requested. • Very inefficient • Scan large base tables • Join large intermediate results • Evaluate every ranking on every tuple • Full sorting

Therefore the problem is: Unlike Boolean constructs, ranking is second class. Ranking is processed as a Monolithic component (R), always after the Boolean component (B).

σh.star=3AND h.area=m.area How did we make Boolean “first class”? Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter

h.area=m.area σh.star=3 scan(m) scan(h) Splitting and Interleaving σh.star=3AND h.area=m.area Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter (2) B is split into joins and selections, which interleave with each other.

Relational Algebra Enables Such Splitting and Interleaving • Algebra: Arabic word, “bring together broken parts”. • Operators: to break apart the Boolean condition (splitting) • selection, projection, join, etc. • Algebraic Laws: to bring them together (interleaving) • pushdown selection: σc(R ΧS) = σc(R) Χ S • split selection: Σc1Λc2 (R) = σc1(σc2 (R)) • Relational algebra is the foundation of query optimization

Rank-Relational Algebra By: • Splitting ranking function • Interleaving ranking with Boolean operations We achieve: • Support ranking as a first-class query type • Integrate rankingwith traditional Boolean constructs • Enable efficient query plans that were unavailable

5 results 5 results R μclose sort cheap+close+related rank-joinh.area=m.area B μrelated σh.star=3 h.area=m.area σh.star=3 rank-scan(h,cheap) scan(m) scan(m) scan(h) Ranking Query Plan split and interleave: reduction of intermediate results, thus processing cost materialize-then-sort: naïve, overkill

Possibly orders of magnitude improvement Implementation in PostgreSQL plan1: traditional materialize-then-sort plan plan2-4: new ranking query plans Observations: an extended plan space with plans of various costs.

μp3 μp2 Scanp1(H) Example • Ranking function split, predicates evaluated incrementally. • Iterator structure. • Output order: how “promising” according to the evaluated predicates. Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 GetNext() Hp1

hotel upper-bound μp3 hotel upper-bound μp2 hotel upper-bound Scanp1(H) Example Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 Hp1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example h2 2.9 Select * From Hotel H Order By p1+p2+p3 Limit 1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

In contrast: materialize-then-sort Sortp1+p2+p3 Select * From Hotel H Order By p1+p2+p3 Limit 1 Scan(H)

Data Model: of Rank-Relation • Two Logical Properties • Membershipof the tuples: evaluated Boolean predicates • Orderamong the tuples: evaluated ranking predciates rank-relation Membership(B): h.area=m.area, h.star=3 Order(R): close(h.addr, “BWI airport”) related(m.collection,“dinosaur”) cheap(h.price) B Membership(B): h.star=3 Order(R): cheap(h.price) A rank-relation

Ranking Principle:upper-bound determines the order F=cheap + close + related • Processing in the “promising” order, avoiding unnecessary processing. B tuple h1 tuple h2 • Without further processing h1, we cannot output any result; A

Ranking Operators To achieve splitting and interleaving: • New operator: • μ: evaluate ranking predicates piece by piece. • Extended operators: • rank-selection • rank-join • rank-scan • rank-union, rank-intersection.

parser, rewriter operatorsandalgebraic laws Logical Query Plan optimizer Plan enumerator cost model implementation of operators Physical Query Plan Results executor Impact of Rank-Relational Algebra Ranking Query

Skyline Queries • A real estate example dist P4 P3 Skyline on price & dist P1 P5 P2 price • skyline contains points that are not dominated by any other point. [slide courtesy of Yidong Yuan]

Preference Queries • Semantically more flexible than top-k queries • Instead of enforcing total order, it allows partial order. • Efficient processing is more challenging

Motivation - Preferences SELECT * FROM used_cars PREFERRING (make = ‘Honda’ ELSE make = ‘Ford’) AND price AROUND 40000 SELECT * FROM used_cars WHERE (make = ‘Honda’ OR make = ‘Ford’) AND price = 40000

Preferences Semantics: Strict Partial Order • Preference: strict partial order A > B : “I like A better than B” • Example Car Make: {Honda, Ford, Toyota} Honda > Ford > {Toyota, Pontiac}

Beyond SQL: Structured Data Retrieval by Ranking

Beyond SQL: Structured Data Retrieval by Ranking

Presentation Transcript

Toward Entity Retrieval over Structured and Text Data

Structured Query Language (SQL)

SQL (Structured Query Language)

Structured Query Language SQL

SQL – Structured Query Langauge

Structured Query Language (SQL)

Toward Entity Retrieval over Structured and Text Data

ExtMiner: Combining Multiple Ranking and Clustering Algorithms for Structured Document Retrieval

Structured Text Retrieval Models

SQL ( Structured Query Language)

Structured Query Language (SQL)

Structured Query Language (SQL)

Structured Query Language - SQL

SQL (Structured Query Language)

Structured Query Language (SQL)