200 likes | 286 Views
Mining Favorable Facets. Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Jian Pei (Simon Fraser University) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University). KDD ’ 07, August 12-15, 2007, San Jose, California, USA. Outline. Introduction
E N D
Mining Favorable Facets Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Jian Pei (Simon Fraser University) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) KDD’ 07, August 12-15, 2007, San Jose, California, USA
Outline • Introduction • Skyline • Algorithm • Empirical Study • Conclusion
1. Introduction Suppose we want to look for a vacation package We want to have cheaper price. We want have a higher hotel-class. Suppose we compare package a and b • We know that package a is “better” • than package b • because • Price of package a is smaller • Hotel-class of package a is higher 3 packages 1000 Package a “dominates” package b 5
1. Introduction Thus, we do not need to consider package b. • We know that • Package a has a cheapest price • Package c has a highest hotel-class Packge a and c don’t dominate by other points Thus, package a and package c are all of the “best” possible choices. We call that package a and package c are skyline points.
Suppose a customer have the following preferences. H < T < M Suppose another customer have the following preferences. H < M < T The skyline points are packages a and c. The skyline points are packages a, c and e. Suppose we want to look for a vacation package 6 packages Different customers may have different preferences on Hotel-group. In other words, different preferences give differentn skyline points.
Suppose hotel-group Mozilla wants to promote its own packages (e.g., package f) to potential customers. 1. Introduction Alice T < M {a, c} Bob No special preference {a, c, e, f} What preferences make package f a skyline point? {a, c, e} Chris H < M {a, c, e} David H < M < T Emily H < T < M {a, c} {a, c, e, f} Fred M < T Bob and Fred are the potential customers.
1. Introduction Problem: Given a package, we want to find what preferences or conditions that this package is a skyline point? Favorable facets
{T < M} {T < H} {H < M} {H < T} {M < T} {M < H} {T < M, H < M} {T < M, T < M} {H < T, H < M} {T < H, M < H} … SKY={a,c} SKY={a,c} SKY={a,c,e} SKY={a,c,e,f} {T < M, T < M, H < M} {T < M, T < M, M < H} SKY={a,c} SKY={a,c} T Problem: Given a package, we want to find what preferences or favorable facets that this package is a skyline point? 1. Introduction We can solve the problem by a naive method: Lattice Search {} SKY={a, c, e, f} SKY={a,c} SKY={a,c,e} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} SKY={}
, {M < H} {T < M} {H < M} {T < M, H < M} {T < M, T < M} {H < T, H < M} … SKY={a,c} SKY={a,c} SKY={a,c,e} {T < M, T < M, H < M} {T < M, T < M, M < H} SKY={a,c} SKY={a,c} T Problem: Given a package, we want to find what preferences or favorable facets that this package is a skyline point? 1. Introduction We can solve the problem by a naive method: Lattice Search Consider package f Preferences: {} , {T < H} , {M < T} , {H < T} {} , {T < H, M < H} SKY={a, c, e, f} {T < H} {H < T} {M < T} {M < H} SKY={a,c} SKY={a,c,e} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} {T < H, M < H} SKY={a,c,e,f} SKY={}
We need to compute all skyline points for each possible preference • There are many preferences which qualify package f as a skyline point • This approach has two disadvantages. 1. Computation is costly. 2. It is difficult to interpret the results.
{T < M} {H < M} {T < M, H < M} {T < M, T < M} {H < T, H < M} … SKY={a,c} SKY={a,c} SKY={a,c,e} {T < M, T < M, H < M} {T < M, T < M, M < H} SKY={a,c} SKY={a,c} T Problem: Given a package, we want to find what preferences or favorable facets that this package is a skyline point? 1. Introduction We can solve the problem by a naive method: Lattice Search Consider package f We find that whenever the preference contains “T < M” or “H < M”, package f is not a skyline point. {} border for f SKY={a, c, e, f} {T < H} {H < T} {M < T} {M < H} SKY={a,c} SKY={a,c,e} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} SKY={a,c,e,f} We can say that “T < M” or “H < M” is a minimal disqualifying condition (MDC). {T < H, M < H} SKY={a,c,e,f} SKY={}
3. Algorithm • How to find MDCs of a point? Problem: Given a package, we want to find what minimal conditions that this package is NOT a skyline point?
3. Algorithm Point q is said to quasi-dominate point p if all attributes of point q are NOT worse than those of point p. e.g. Package a quasi-dominates package f because 1. Package a has a lower (or better) price than package f 2. Package a has a higher (or better) hotel-class than package f If package a quasi-dominates package f, we define Raf as follows. {T < M}
Problem: Given a package, we want to find what minimal conditions that this package is NOT a skyline point? 3. Algorithm • Two Algorithms • MDC-O: Computing MDC On-the-fly • Does not store MDCs of points • Compute MDC of a given points on-the-fly • MDC-M: A Materialization Method • Store MDCs of all points • Indexing Method for Speed-up • R*-tree
Problem: Given a package, we want to find what minimal conditions that this package is NOT a skyline point? 3.1 MDC-O: Computing MDC On-the-fly • On-the-fly Algorithm • Given • data point p • Variable • MDC(p): minimal disqualifying condition • Algorithm • MDC(p) • For each data point q which quasi-dominates p • if MDC(p) does not contain Rqp • insert Rqp to MDC(p) • Return MDC(p)
3.2 MDC-M: A Materialization Method Problem: Given a package, we want to find what minimal conditions that this package is NOT a skyline point? • Materialization Algorithm • Variable • MDC(p): minimal disqualifying condition • Algorithm MDC(p) • For each data point p • For each data point q which quasi-dominates p • if MDC(p) does not contain Rqpthen insert Rqp to MDC(p) • Store MDC(p)
4. Empirical Study • Datasets • Synthetic Dataset • Real Dataset (from UCI) • Nursery Dataset • Automobile Dataset • Default Values (Synthetic) • No. of tuples = 500K • No. of numeric dimensions = 3 • No. of categorical dimensions = 1 • No. of values in a nominal dimension = 20
4. Empirical Study Without indexing: MDC-O: Slowest Search Time MDC-M: Faster Search Time Storage of MDC: 8MB With indexing: MDC-O and MDC-M: Fast Search Time
4. Empirical Study A salesperson should NOT promote this car to the customer who prefers Toyota to Honda. • Automobile • Three car models A salesperson should NOT promote this car to the customer who prefers Toyota to Honda. A salesperson should promote this car to ANY customers.
5. Conclusion • Skyline • Favorable Facets • Minimal Disqualifying Condition • Algorithm • On-the-fly • Materialization • Empirical Study