960 likes | 1.36k Views
Scorpion Explaining Away Outliers in Aggregate Queries . eugene wu and sam madden MIT . http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html. Table. Split. Visualize. Aggregate. SELECT sum(cost) FROM expenses GROUPBY country. Expenses. USA. China. Italy.
E N D
ScorpionExplaining Away Outliers in Aggregate Queries eugenewuand sam madden MIT http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
Table Split Visualize Aggregate
SELECT sum(cost) FROM expenses GROUPBY country Expenses USA China Italy
Expenses USA China Italy SELECT sum(cost) FROM expenses GROUPBY country
Expenses USA China Italy SELECT sum(cost) FROM expenses GROUPBY country
Expenses USA China Italy SELECT sum(cost) FROM expenses GROUPBY country
Given Outlier and normal results Understand Why Expenses USA China Italy SELECT sum(cost) FROM expenses GROUPBY country
Given Outlier and normal results What input properties caused the outliers? most caused the outliers? caused outliers but didn’t affect normal outputs? Expenses USA China Italy SELECT sum(cost) FROM expenses GROUPBY country
Provenance Data!
Provenance SELECT SUM(cost) FROM sam’s bank account $$$
SELECT SUM(cost) FROM sam’s bank account Provenance $$$
Provenance SELECT SUM(cost) FROM sam’s bank account $$$
Provenance SELECT SUM(cost) FROM sam’s bank account $$$ Darn! Ya caught me
Provenance http://weknowmemes.com/2012/04/whats-the-point/
SELECT SUM(cost) FROM sam’s bank account Provenance Filter for “most influential”
Provenance Faceting http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance Faceting http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance Faceting Dimensionality :( Dealing with multiple outliers? http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance Faceting
Provenance Faceting Scorpion!
Given Outlier and normal results Understand Why Expenses USA China Italy
Given Outlier and normal results Find Predicates correlated with outliers Expenses USA China Italy Desc = “toilets”
Given Outlier and normal results Removing predicate from inputs “fixes” outliers & maintains normal results Find Predicates correlated with outliers Expenses s.t. USA China Italy Desc = “toilets”
Given Outlier and normal results Find Predicates correlated with outliers Expenses s.t. Removing predicate from inputs “fixes” outliers & maintains normal results USA China Italy Desc = “toilets”
Given Outlier and normal results Removing predicate from inputs “fixes” outliers & maintains normal results Find Predicates correlated with outliers Expenses s.t. USA China Italy Desc = “toilets”
Formalize “influence” as metric Predicate search heuristics Some results
T Desc = “toilet” p(T)
T p(T)
T p(T) T – p(T)
Δoutput p(T)
Δoutput p(T) |p(T)|
Δoutput p(T) |p(T)| Δoutput Influence Metric |p(T)|
Δf(x) Sensitivity Analysis Δx Δoutput Influence Metric |p(T)|
Δoutput |p(T)| ΔOutput “High vs Low” |p(T)| ΔNormal Multiple Outputs
Δoutput |p(T)| ΔOutput “High vs Low” |p(T)| ΔNormal Multiple Outputs ΔoutputV |p(T)|
Δoutput |p(T)| ΔOutput “High vs Low” |p(T)| ΔNormal Multiple Outputs ΔoutputV |p(T)| ΔoutputV |p(T)|c
Δoutput |p(T)| ΔOutput “High vs Low” |p(T)| ΔNormal Multiple Outputs ΔoutputV |p(T)| ΔoutputV |p(T)|c - ΔoutlierV ΔNormal |p(T)|c
Δoutput |p(T)| ΔOutput “High vs Low” |p(T)| ΔNormal Multiple Outputs ΔoutputV |p(T)| ΔoutputV |p(T)|c - ΔoutlierV ΔNormal |p(T)|c - ΔoutlierV max mean ΔNormal |p(T)|c normal outlier