700 likes | 869 Views
Hierarchies in Data Mining. Raghu Ramakrishnan ramakris@yahoo-inc.com Chief Scientist for Audience and Cloud Computing Yahoo!. About this Talk. Common theme—multidimensional view of data: Reveals patterns that emerge at coarser granularity
E N D
Hierarchies in Data Mining Raghu Ramakrishnan ramakris@yahoo-inc.com Chief Scientist for Audience and Cloud Computing Yahoo!
About this Talk • Common theme—multidimensional view of data: • Reveals patterns that emerge at coarser granularity • Widely recognized, e.g., generalized association rules • Helps handle imprecision • Analyzing imprecise and aggregated data • Helps handle data sparsity • Even with massive datasets, sparsity is a challenge! • Defines candidate space of subsets for exploratory mining • Forecasting query results over “future data” • Using predictive models as summaries • Potentially, space of “mining experiments”?
Star Schema TIME timeid date week year SERVICE pid timeid locid repair PRODUCT pid pname Category Model LOCATION locid country region state “FACT” TABLE DIMENSION TABLES
Dimension Hierarchies • For each dimension, the set of values can be organized in a hierarchy: PRODUCT TIME LOCATION year automobile quarter country category week month region model date state
Multidimensional Data Model • One fact table D=(X,M) • X=X1, X2, ...Dimension attributes • M=M1, M2,…Measure attributes • Domain hierarchy for each dimension attribute: • Collection of domains Hier(Xi)= (Di(1),..., Di(k)) • The extended domain: EXi = 1≤k≤t DXi(k) • Value mapping function: γD1D2(x) • e.g., γmonthyear(12/2005) = 2005 • Form the value hierarchy graph • Stored as dimension table attribute (e.g., week for a time value) or conversion functions (e.g., month, quarter)
Multidimensional Data Automobile 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region DIMENSION ATTRIBUTES 1 Model Civic Camry F150 Sierra p3 p4 MA East NY p1 p2 ALL LOCATION TX West CA
Cube Space • Cube space: C = EX1EX2…EXd • Region: Hyper rectangle in cube space • c = (v1,v2,…,vd) , vi EXi • E.g., c1= (NY, Camry); c2 = (West, Sedan) • Region granularity: • gran(c) = (d1, d2, ..., dd), di = Domain(c.vi) • E.g., gran(c1) = (State, Model); gran(c2) = (State, Category) • Region coverage: • coverage(c) = all facts in c • Region set: All regions with same granularity
OLAP Over Imprecise Datawith Doug Burdick, Prasad Deshpande, T.S. Jayram, and Shiv VaithyanathanIn VLDB 05, 06 joint work with IBM Almaden
Imprecise Data Automobile 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region 1 Model Civic Camry F150 Sierra p3 p4 MA p5 East NY p1 p2 ALL LOCATION TX West CA
Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? How do we treat p5? Truck F150 Sierra p5 p4 MA p3 East NY p1 p2
Allocation (1) F150 Sierra Truck p5 MA p3 p4 East NY p1 p2
Allocation (2) F150 Sierra (Huh? Why 0.5 / 0.5? - Hold on to that thought) Truck p5 p5 MA p3 p4 East NY p1 p2
Allocation (3) F150 Sierra Auto = F150 Loc = MA SUM(Repair) = 150 Query the Extended Data Model! Truck p5 p5 MA p3 p4 East NY p1 p2
Allocation Policies • Procedure for assigning allocation weights is referred to as an allocation policy • Each allocation policy uses different information to assign allocation weight • Key contributions: • Appropriate characterization of the large space of allocation policies (VLDB 05) • Designing efficient algorithms for allocation policies that take into account the correlations in the data (VLDB 06)
Motivating Example p1 p3 p4 p2 Query: COUNT Truck F150 Sierra We propose desiderata that enable appropriate definition of query semantics for imprecise data MA p5 East NY
Consistency specifies the relationship between answers to relatedqueries on a fixeddata set Desideratum I: Consistency p4 Truck F150 Sierra p3 MA p5 East NY p1 p2
Desideratum II: Faithfulness F150 Sierra F150 Sierra MA MA NY NY p1 p3 p1 p4 p2 p5 p3 p3 p5 p2 p4 p2 p1 p5 p4 Data Set 1 Data Set 2 Data Set 3 F150 Sierra MA NY • Faithfulness specifies the relationship between answers to a fixed query on related data sets
F150 F150 F150 F150 F150 Sierra Sierra Sierra Sierra Sierra MA MA MA MA MA p3 p4 p5 NY NY NY NY NY p2 p1 p5 p4 p3 Imprecise facts lead to many possible worlds [Kripke63, …] p1 p2 p3 w1 p5 p4 w4 w2 w3 p2 p1 p5 p4 p4 p5 p3 p3 p2 p2 p1 p1
Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered using expected values • But number of possible worlds is exponential! • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data • Size increase is linear in number of (completions of) imprecise facts • Queries operate over this extended version
Dealing with Data Sparsity Deepak Agarwal, Andrei Broder, Deepayan Chakrabarti, Dejan Diklic, Vanja Josifovski, Mayssam Sayyadian Estimating Rates of Rare Events at Multiple Resolutions, KDD 2007
Motivating ApplicationContent Match Problem pages ads • Problem: • Which ads are good on what pages • Pages: no control; Ads: can control • First simplification: • (Page, Ad) completely characterized by a set of high-dimensional features • Naïve Approach: • Experiment with all possible pairs several times and estimate CTR. • Of course, this doesn’t work • Most (ad, page) pairs have very few impressions, if any, • and even fewer clicks • Severe data sparsity
Estimation in the “Tail” • Use an existing, well-understood hierarchy • Categorize ads and webpages to leaves of the hierarchy • CTR estimates of siblings are correlated • The hierarchy allows us to aggregate data • Coarser resolutions • provide reliable estimates for rare events • which then influences estimation at finer resolutions Similar “coarsening”, different motivation: Mining Generalized Association Rules Ramakrishnan Srikant, Rakesh Agrawal , VLDB 1995
Sampling of Webpages • Naïve strategy: sample at random from the set of URLs • Sampling errors in impression volume AND click volume • Instead, we propose: • Crawling all URLs with at least one click, and • a sample of the remaining URLs • Variability is only in impression volume
Imputation of Impression Volume Z(0) • Region node= (page node, ad node) • Build a Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Z(i) Leaf Region Page leaves Ad leaves Page hierarchy Ad hierarchy
Exploiting Taxonomy Structure • Consider the bottom two levels of the taxonomy • Each cell corresponds to a (page, ad)-class pair • Key point: Children under a parent node are alike and expected to have similar CTRs (i.e., form a cohesive block)
Imputation of Impression Volume Ad classes Clicked pool Sampled Non-clicked pool Excess impressions(to be imputed) Page classes For any level Z(i) #impressions = nij + mij + xij sums to ∑nij + K.∑mij[row constraint] sums toTotal impressions(known) sums to #impressions on ads of this ad class[column constraint]
Imputation of Impression Volume sums to [block constraint]
Imputing xij • Iterative Proportional Fitting [Darroch+/1972] • Initialize xij = nij + mij • Top-down: • Scale all xij in every block in Z(i+1) to sum to its parent in Z(i) • Scale all xij in Z(i+1) to sum to the row totals • Scale all xij in Z(i+1) to sum to the column totals • Repeat for every level Z(i) • Bottom-up: Similar Z(i) Z(i+1) block Page classes Ad classes
Imputation: Summary • Given • nij (impressions in clicked pool) • mij (impressions in sampled non-clicked pool) • # impressions on ads of each ad class in the ad hierarchy • We get • Estimated impression volumeÑij = nij + mij + xijin each region ij of every level Z(.)
Dealing with Data Sparsity Deepak Agarwal, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, Joe Zachariah Real-time Content Optimization through Active User Feedback, NIPS 2008
Yahoo! Home Page Featured Box • It is the top-center part of the Y! Front Page • It has four tabs: Featured, Entertainment, Sports, and Video
Novel Aspects • Classical: Arms assumed fixed over time • We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research • Classical: Serving rule updated after each pull • We compute optimal design in batch mode • Classical: Generally. CTR assumed stationary • We have highly dynamic, non-stationary CTRs
Bellwether Analysis:Global Aggregates from Local Regionswith Beechung Chen, Jude Shavlik, and Pradeep TammaIn VLDB 06
Motivating Example • A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) • By looking at features and profits of previous (similar) movies, we predict expected total profit (1-year US sales) for new movie • Wait a year and write a query! If you can’t wait, stay awake … • The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). • Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc. • Gathering this data has a cost (e.g., marketing expenses, waiting time) • Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”
Key Ideas • Large datasets are rarely labeled with the targets that we wish to learn to predict • But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining • We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result • The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case
Motivating Example • A company wants to predict the first year’s worldwide profit for a new item, by using its historical database • Database Schema: • The combination of the underlined attributes forms a key
A Straightforward Approach • Build a regression model to predict item profit • There is much room for accuracy improvement! By joining and aggregating tables in the historical database we can create a training set: Item-table features Target An Example regression model: Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense
Using Regional Features • Example region: [1st week, HK] • Regional features: • Regional Profit: The 1st week profit in HK • Regional Ad Expense: The 1st week ad expense in HK • A possibly more accurate model: Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense + 4Profit[1wk, HK] + 5AdExpense[1wk, HK] • Problem: Which region should we use? • The smallest region that improves the accuracy the most • We give each candidate region a cost • The most “cost-effective” region is the bellwether region
Basic Bellwether Problem Features i,r(DB) Target i(DB) Total Profit in [1-52, All] Aggregate over data records in region r = [1-2, USA] r • For each region r, build a predictive model hr(x); and then choose bellwether region: • Coverage(r) fraction of all items in region minimum coverage support • Cost(r,DB) cost threshold • Error(hr) is minimized
Experiment on a Mail Order Dataset Error-vs-Budget Plot • Bel Err: The error of the bellwether region found using a given budget • Avg Err: The average error of all the cube regions with costs under a given budget • Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget [1-8 month, MD] (RMSE: Root Mean Square Error)
Experiment on a Mail Order Dataset Uniqueness Plot • Y-axis: Fraction of regions that are as good as the bellwether region • The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region • We have 99% confidence that that [1-8 month, MD] is a quite unusual bellwether region [1-8 month, MD]
Basic Bellwether Computation • OLAP-style bellwether analysis • Candidate regions: Regions in a data cube • Queries: OLAP-style aggregate queries • E.g., Sum(Profit) over a region • Efficient computation: • Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01) • Infeasible regions: Regions with cost > B or coverage < C • Share computation by generating the features and target values for all the feasible regions all together • Exploit distributive and algebraic aggregate functions • Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation
Subset-Based Bellwether Prediction • Motivation: Different subsets of items may have different bellwether regions • E.g., The bellwether region for laptops may be different from the bellwether region for clothes • Two approaches: Bellwether Cube Bellwether Tree R&D Expenses Category
Characteristics of Bellwether Trees & Cubes • Result: • Bellwether trees & cubes have better accuracy than basic bellwether search • Increase noise increase error • Increase complexity increase error • Dataset generation: • Use random tree to generate • different bellwether regions • for different subset of items • Parameters: • Noise • Concept complexity: # of tree nodes 15 nodes Noise level: 0.5
Efficiency Comparison Naïve computation methods Our computation techniques
Exploratory Mining:Prediction Cubeswith Beechung Chen, Lei Chen, and Yi LinIn VLDB 05
The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior • In effect, build a tree for each cell/region in the cube—observe that this is not the same as a collection of trees used in an ensemble method! • The idea is simple, but it leads to promising data mining tools • Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters …
Example (1/7): Regular OLAP Location Time Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications: