620 likes | 707 Views
University of Athens. ADBIS 2007. Database Implementation of a Model-Free Classifier. Konstantinos Morfonios. Introduction. Motivation. LOCUS. Parallel Execution. Experimental Evaluation. Conclusions & Future Work. Introduction. Motivation. LOCUS. Parallel Execution.
E N D
University of Athens ADBIS 2007 Database Implementation of a Model-Free Classifier Konstantinos Morfonios
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
ω1 = ω2 = Introduction Classification x = <x1, x2, …, xD> ω = f(x)
Introduction <x1,1, x1,2, …, x1,D, ω1> <x2,1, x2,2, …, x2,D, ω2> <x3,1, x3,2, …, x3,D, ω1> <x4,1, x4,2, …, x4,D, ω1> . . . x1 = <x1, x2, …, xD> x2 = <x1, x2, …, xD> “Lazy” “Eager” (Nearest Neighbors) (Decision Trees) (+) Faster decisions (-) Large/complex datasets (-) Dynamicdatasets (-) Dynamicmodels
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Motivation • Large/complex datasets
Motivation • Large/complex datasets • Dynamic datasets
Motivation • Large/complex datasets • Dynamic datasets • Dynamic models
Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free)
Disk-based Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free) Nearest Neighbors
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Suffers from “curse of dimensionality” • Not reliable [Beyer et al., ICDT 1999] • Not indexable [Shaft et al., ICDT 2005] Nearest Neighbors
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Category?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Scaling?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Accuracy?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Other features?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Parallelizable
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
f2 ω1 = x = <f1, f2> (f1 [0, 20], f2 [0, 10]) ω2 = f1 LOCUS Example
f2 f1 LOCUS Ideally: Dense space
LOCUS f2 ω(<7, 4>) = ? Ideally: Dense space f1
LOCUS f2 ω(<7, 4>) = f1
LOCUS f2 • Many features • Large domains Sparse space Reality: f1
Many features • Large domains Sparse space Reality: LOCUS f2 ω(<7, 4>) = ? ? f1
LOCUS ω1: 2 f2 ω(<7, 4>) = ? ω2: 1 f1 3-NN
LOCUS ω1: 2 f2 ω(<7, 4>) = ω2: 1 f1 3-NN
LOCUS f2 ω(<7, 4>) = ? f1 LOCUS
LOCUS ω1: 7 f2 ω(<7, 4>) = ? ω2: 3 f1 LOCUS
LOCUS ω1: 7 f2 ω(<7, 4>) = ω2: 3 f1 LOCUS
LOCUS f2 Disk-based implementation f1 LOCUS
2δ2 2δ1 LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) ω1: 7 ω(<7, 4>) = ω2: 3 <x1, x2>
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What ifR is large? Classical optimization techniques for a well-known type of aggregate queries • Indexing • Materialized views • Presorting
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex)
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2=x2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex) Not a problem, since generally in practice: • Combinations of categorical and numericfeatures • Categorical features have small domains Hence, they do not contribute to sparsity
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
R1 R2 R3 R4 SELECT SELECT SELECT SELECT Parallel Execution R = R1 R2 R3 R4
R1 R4 R3 R2 12 3 5 2 18 3 23 4 Parallel Execution Count: distributive function ω1: 23 ω1: 7 ω1: 5 ω2: 4 ω2: 1 ω2: 2 ω1: 6 ω2: 0 ω1: 5 ω2: 1
R1 R4 R3 R2 12 3 5 2 18 3 23 4 ω1: 7 ω1: 5 SELECT SELECT ω2: 1 ω2: 2 ω1: 6 SELECT ω2: 0 SELECT ω1: 5 ω2: 1 Parallel Execution • Small network traffic • Load balancing • Lightweight operations on the main server ω1: 5 ω1: 7 ω2: 2 ω2: 1 ω1: 6 ω2: 0 ω1: 5 ω2: 1
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work