280 likes | 488 Views
Structured Query Result Differentiation . ZIYANG LIU , Peng Sun, Yi Chen Arizona State University. Keyword Search on Structured Data. Structured Data. Results: Relevant Data Fragments. Effective techniques have been developed to help users find relevant results?
E N D
Structured Query Result Differentiation ZIYANG LIU, Peng Sun, Yi Chen Arizona State University
Keyword Search on Structured Data Structured Data Results: Relevant Data Fragments • Effective techniques have been developed to help users find relevant results? • Ranking: sort the results in the order of estimated relevance • Snippet: provide a summary of each result to help users judge relevance • 50% of keyword searches are information exploration queries, which inherently have multiple relevant results • Users intend to investigate and compare multiple relevant results. • How to help user comparerelevant results? Search Engine Keywords Web Search 50% Navigation 50% Information Exploration Broder, SIGIR 02
Results and Snippets (Huang et al. SIGMOD 09) store “Phoenix, camera, store” store city name merchandises city name merchandises Phoenix BHPhoto Phoenix BHPhoto Snippet …… camera camera camera camera category megapixel brand category brand megapixel brand brand megapixel DSLR Canon 12 DSLR Sony 12 12 Canon Canon store store Snippets are unhelpful in differentiating query results. merchandises city name name merchandises city Phoenix Adorama Snippet Adorama Phoenix camera …… camera camera megapixel category brand category category megapixel brand brand megapixel Compact Canon 12 Compact Canon HP Compact 12 14
Differentiation Feature Sets(DFS) Feature: (entity, attribute, value) store city name merchandises Phoenix BHPhoto DFS …… camera camera category megapixel brand category brand megapixel DSLR Canon 12 DSLR Sony 12 Bank websites usually allow users to compare selected credit cards, however, only with a pre-defined feature set. store merchandises city name Phoenix Adorama DFS …… camera camera category category megapixel brand brand megapixel HP Compact Canon Compact 14 12
Challenges of Result Differentiation • How to automatically generate DFS that highlight the differences among results? • How to measure the quality of a set of DFSs? • DFSs should obviously maximize the difference among results. How to quantify it? • What are other desirable properties? • Can DFSs be efficiently generated from results?
Contributions • 1st work on automatically differentiating structured search results • Application domains: online shopping, employee hiring, job/institution hunting, etc. • Identifying 3 desiderata for good DFSs • Quantifying the differentiation power of a set of DFSs • Proving the NP-hardness of DFS generation • Tackling the problem using two local optimality criteria • Single-swap / Multi-swap optimality • Implemented XRed: XML Result Differentiation • Empirically verified the effectiveness & efficiency of XRed
Roadmap • Desiderata for good DFSs • Problem definition • Local optimality and algorithms • Experiments
Desideratum 1Being Small • A Small DFS is easy for user to go through and compare with other DFSs. • The size of each DFS, |D|, cannot exceed a user-specified upper bound L • |D| ≤ L
Desideratum 2Summarizing Query Results DFSs that do not summarize the results show useless & misleading differences. store city name merchandises Phoenix BHPhoto DFS …… camera camera category megapixel brand category brand megapixel DSLR Canon 12 DSLR Sony 12 This store sells only a few HP cameras. store merchandises city name Phoenix Adorama DFS …… camera camera category category megapixel brand brand megapixel HP Compact Canon Compact 14 12
Desideratum 2Summarizing Query Results DFSs that do not summarize the results show useless & misleading differences. store city name merchandises Phoenix BHPhoto DFS …… camera camera category megapixel brand category brand megapixel DSLR Canon 12 DSLR Sony 12 This store sells only a few HP cameras. store merchandises city name Phoenix Adorama DFS …… camera camera category category megapixel brand brand megapixel HP Compact Canon Compact 14 12
Desideratum 2Summarizing Query Results • A DFS is valid only if it summarizes the corresponding result. • Features of the same type should be included in order of occurrences. • Ratios of two features in the DFS should be roughly the same as in the result. • Dominance Ordered • Distribution Preserved
Desideratum 3Differentiating Query Results • Differentiation unit: feature type. • A feature type t in two DFSs D1 and D2 is differentiable if • The order of the features of type t is different. • The ratio of two features of type t is different. D1. Camera: brand: Canon D2. Camera: brand: Canon Camera: brand: HP D1. Camera: brand: Canon D2. Camera: brand: HP D1. Camera: brand: Canon Camera: brand: HP D2. Camera: brand: Canon Camera: brand: Canon Camera: brand: HP
Desideratum 3Differentiating Query Results Degree of Differentiation (DoD) of two DFSs = Number of differentiable feature types. DoD = 3 • DoD of multiple DFSs = the sum of DoD of every pair of DFS.
Roadmap • Desiderata for good DFSs • Problem definition • Local optimality and algorithms • Experiments
DFS Generation Problem • Given a set of results and a size limit L, generate a DFS for each result such that • Their DoD is maximized. • Every DFS is valid (good summary) • Every DFS’s size does not exceed L. • We proved the NP-hardness of this problem by reduction from X3C.
Roadmap • Desiderata for good DFSs • Problem definition • Local optimality and algorithms • Experiments
Local Optimality • To tackle this hard problem, instead of achieving global optimality, we propose two local optimality criteria: • Single-swap Optimality • Multi-swap Optimality
Single Swap # of cameras: 200 Category: DSLR: 188 Others: 12 Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22 Megapixel: 12: 160 13: 15 14: 20 STORE 1 • A set of DFSs is Single-Swap Optimal, if adding / changing a single feature in a single DFS (subject to validity and size limit) cannot increase the DoD. # of cameras: 150 Category: Compact: 140 Others: 10 Brand: Canon: 80 HP: 70 Megapixel: 12: 105 13: 5 14: 19 DoD = 1 DoD increases to 2 STORE 2 Achieved Single-Swap Optimal
Algorithm for Single-Swap Optimality • Start from a randomly generated DFS for each result. • Repeatedly add a feature / change a feature in a DFS. • Stop until the DoD no longer increases. Does this algorithm terminate in polynomial time? Yes: The maximum possible DoD for a set of DFSs is POLYNOMIAL. Each iteration increases the DoD at least by 1. Each iteration takes polynomial time.
Multi-Swap Optimality # of cameras: 200 Category: DSLR: 188 Others: 12 Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22 Megapixel: 12: 160 13: 15 14: 20 STORE 1 • A set of DFSs is Multi-Swap Optimal, if adding / changing any number of features in a single DFS (subject to validity and size limit) cannot increase the DoD. # of cameras: 150 Category: Compact: 140 Others: 10 Brand: Canon: 80 HP: 70 Megapixel: 12: 105 13: 5 14: 19 DoD = 2 DoD increases to 3 STORE 2
Algorithm for Multi-Swap Optimality • Start from a randomly generated DFS for each result. • Repeatedly add / change multiple features in a DFS. • Stop until the DoD no longer increases. • This algorithm has exponential time complexity! • We designed a novel dynamic programming algorithm, which takes pseudo-polynomial time
Evaluation • We have implemented Xred (XML Result Differentiation) and evaluated it empirically. • Data sets • Film (http://infolab.stanford.edu/pub/movies) • Camera Retailer (synthetic) • Result generation: XSeek (http://xseek.asu.edu/) • DFS size limit: 10% of # of feature types • Metrics: • Quality (DoD) • Efficiency • Comparison system: exponential algorithm that generates optimal solution.
DFS Quality Film Camera Retailer
Efficiency Film Result Size 1KB ~ 9KB # of Results 2 ~ 52 Camera Retailer
Conclusions • We initiate the problem of automatically differentiating structured query results, which is useful for information exploration queries. • We define Differentiation Feature Set (DFS) for each result, and identify three desiderata for DFS. • We formalize the DFS generation problem, and prove its NP-hardness. • We propose two local optimality criteria: single-swap and multi-swap, and design algorithms to efficiently achieve them. • We implemented the XRed system, and verified its effectiveness and efficiency through experiments.
Future Work Result differentiation is a new area and opens opportunities for new research topics. • Is there a better way of selecting feature types, e.g., by considering users’ interests? • Is there a better way of measuring the quality of DFSs besides DoD? • Are there approximation / randomized algorithms for DFS generation problem that achieve better quality / efficiency?