Beyond SQL: Structured Data Retrieval by Ranking

Beyond SQL: Structured Data Retrieval by Ranking Chengkai Li October 25, 2006 CS511 guest lecture

About Myself • Advisor: Kevin Chang (UIUC) • Research Interests • Structured Data Retrieval and Ranking Query Processing • Web Information Integration • XML Query Processing • Misc. • http://www.ews.uiuc.edu/~cli • cli [at] uiuc [dot] edu

Boolean Semantics of SQL: Success and Barrier • Example: SELECT * FROM Homes H WHERE 400K<price<600K AND #bedroom = 4 • Problems: • Query Semantics: • True or False, nothing in the middle • What do we want?: similarity, relevance, preference, etc. • Query Results Presentation: • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider • What do we want?: a ranked list

“Google” on Databases:From Boolean Query to Fuzzy Retrieval

Fuzzy Retrieval by Ranking • Ranking (Top-k) Queries • Top k results by a ranking function with multiple criteria • SELECT * FROM Homes H Order By 0.3*big(H.size)+0.7*cheap(H.price) Limit 5

Beyond Boolean Semantics: Ranking Useful and important in many real-world database applications: • E-Commerce Find the best hotel deals by price, distance, etc. • Multimedia Databases Find the most similar images by color, shape, texture, etc. • Text Retrieval and Document Management Find the most relevant records/documents/pages. • OLAP, Decision Support Find the top profitable customers.

Application 1: E-commerce • The home search example • More details later

Application 2: Decision Support • What are the top 5 areas to advertise a new insurance product? SELECT zipcode, AVG(income*w1+age*w2+credit*w3) as score FROM customer WHERE occupation=‘student’ GROUP BY zipcode ORDER BY score LIMIT 5

Application 3: Multimedia Databases • Find the most similar images by color, shape, texture, etc. • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm

Application 4: Keyword Queries • Text data coexist with structured data • IR-style database queries [slides courtesy of Vagelis Hristidis]

Example – Complaints Database Schema

Example - Complaints Database Data Complaints Customers Products

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products

Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1

Great Interest in Database Community • Top-k Queries • Top-k to range query transformation: [ChaudhuriG99] • Indexing: [ChangBC+00], [TsaparasPK+03] • View: [HristidisKP01], [YiYY+03], [DasGK+06] • Query operators and algorithms: [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06] • Ranking algebra: [ChaudhuriRW05] [LiCI+05] • Ranking aggregate queries: [LiCI06] • Other Types of Queries • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02] • IR-style TFIDF ranking: [AgrawalCD+03] • Ranking of attributes: [DasHK+06] • Preference queries: [AgrawalW00], [Kießling02] • Skyline queries: [BorzsonyiSK01] • DBRank07

Theme Life Cycles: Bursting Topics(Current Bursting) [slide courtesy of Chengixang Zhai]

RankSQL: a RDBMS with Efficient Support of Ranking Queries • Algebraic Foundation and Optimization Framework SPJ queries (SELECT … FROM … WHERE … ORDER BY …) • Ad-Hoc Ranking Aggregate Queries top k groups instead of tuples (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)

Example: Trip Planning Suggest a hotel to stay and a museum to visit: Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area Order By cheap(h.price) + close(h.addr, “BWI airport”) + related(m.collection,“dinosaur”) Limit 5 membership dimension: Boolean predicates, Boolean function B order dimension: ranking predicates, scoring function R

5 results R sort cheap+close+related B h.area=m.area σh.star=3 scan(m) scan(h) Processing Ranking Queries in Traditional RDBMS Select * From Hotel h, Museum m Where B Order By R Limit 5

5 results R B Problems of Traditional Approach • NaïveMaterialize-then-Sort scheme • Overkill total order of all results; only 5 top results are requested. • Very inefficient • Scan large base tables • Join large intermediate results • Evaluate every ranking on every tuple • Full sorting

Therefore the problem is: Unlike Boolean constructs, ranking is second class. Ranking is processed as a Monolithic component (R), always after the Boolean component (B).

σh.star=3AND h.area=m.area How did we make Boolean “first class”? Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter

h.area=m.area σh.star=3 scan(m) scan(h) Splitting and Interleaving σh.star=3AND h.area=m.area Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area HotelXMuseum (1) materialize-then-filter (2) B is split into joins and selections, which interleave with each other.

Relational Algebra Enables Such Splitting and Interleaving • Algebra: Arabic word, “bring together broken parts”. • Operators: to break apart the Boolean condition (splitting) • selection, projection, join, etc. • Algebraic Laws: to bring them together (interleaving) • pushdown selection: σc(R ΧS) = σc(R) Χ S • split selection: Σc1Λc2 (R) = σc1(σc2 (R)) • Relational algebra is the foundation of query optimization

Rank-Relational Algebra By: • Splitting ranking function • Interleaving ranking with Boolean operations We achieve: • Support ranking as a first-class query type • Integrate rankingwith traditional Boolean constructs • Enable efficient query plans that were unavailable

5 results 5 results R μclose sort cheap+close+related rank-joinh.area=m.area B μrelated σh.star=3 h.area=m.area σh.star=3 rank-scan(h,cheap) scan(m) scan(m) scan(h) Ranking Query Plan split and interleave: reduction of intermediate results, thus processing cost materialize-then-sort: naïve, overkill

Possibly orders of magnitude improvement Implementation in PostgreSQL plan1: traditional materialize-then-sort plan plan2-4: new ranking query plans Observations: an extended plan space with plans of various costs.

μp3 μp2 Scanp1(H) Example • Ranking function split, predicates evaluated incrementally. • Iterator structure. • Output order: how “promising” according to the evaluated predicates. Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 GetNext() Hp1

hotel upper-bound μp3 hotel upper-bound μp2 hotel upper-bound Scanp1(H) Example Hp1, p2, p3 Hp1, p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 Hp1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example h2 2.9 Select * From Hotel H Order By p1+p2+p3 Limit 1

Hp1, p2, p3 μp3 Hp1, p2 μp2 Hp1 Scanp1(H) Example Select * From Hotel H Order By p1+p2+p3 Limit 1

In contrast: materialize-then-sort Sortp1+p2+p3 Select * From Hotel H Order By p1+p2+p3 Limit 1 Scan(H)

Data Model: of Rank-Relation • Two Logical Properties • Membershipof the tuples: evaluated Boolean predicates • Orderamong the tuples: evaluated ranking predciates rank-relation Membership(B): h.area=m.area, h.star=3 Order(R): close(h.addr, “BWI airport”) related(m.collection,“dinosaur”) cheap(h.price) B Membership(B): h.star=3 Order(R): cheap(h.price) A rank-relation

Ranking Principle:upper-bound determines the order F=cheap + close + related • Processing in the “promising” order, avoiding unnecessary processing. B tuple h1 tuple h2 • Without further processing h1, we cannot output any result; A

Ranking Operators To achieve splitting and interleaving: • New operator: • μ: evaluate ranking predicates piece by piece. • Extended operators: • rank-selection • rank-join • rank-scan • rank-union, rank-intersection.

parser, rewriter operatorsandalgebraic laws Logical Query Plan optimizer Plan enumerator cost model implementation of operators Physical Query Plan Results executor Impact of Rank-Relational Algebra Ranking Query

Skyline Queries • A real estate example dist P4 P3 Skyline on price & dist P1 P5 P2 price • skyline contains points that are not dominated by any other point. [slide courtesy of Yidong Yuan]

Preference Queries • Semantically more flexible than top-k queries • Instead of enforcing total order, it allows partial order. • Efficient processing is more challenging

Motivation - Preferences SELECT * FROM used_cars PREFERRING (make = ‘Honda’ ELSE make = ‘Ford’) AND price AROUND 40000 SELECT * FROM used_cars WHERE (make = ‘Honda’ OR make = ‘Ford’) AND price = 40000

Preferences Semantics: Strict Partial Order • Preference: strict partial order A > B : “I like A better than B” • Example Car Make: {Honda, Ford, Toyota} Honda > Ford > {Toyota, Pontiac}

Beyond SQL: Structured Data Retrieval by Ranking

Beyond SQL: Structured Data Retrieval by Ranking

Presentation Transcript

Beyond SQL: Structured Data Retrieval by Ranking

Towards Unifying Database Systems and Information Retrieval Systems

DISTRIBUTED INFORMATION RETRIEVAL

Enhanced Data Models for Advanced Database Applications

IRS Data Retrieval

Defining Data Warehouse Concepts and Terminology

A Generative Retrieval Model for Structured Documents

Image Ranking and Retrieval based on Multi-Attribute Queries CVPR2011

Ranking support for keyword search on structured data using relevance model

Information Retrieval ( IR )

Probabilistic Information Retrieval Approach for Ranking of Database Query Results

CS276: Information Retrieval and Web Search

Querying Structured Text

Spatial-match Iconic Image Retrieval with Ranking in Multimedia Databases

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort

Information Retrieval as Structured Prediction

Information Retrieval

Toward Entity Retrieval over Structured and Text Data

Structure/XML Retrieval

Semi-structured Data