260 likes | 390 Views
Bernhard Pfahringer, Geoff Holmes and Richard Kirkby. Handling Numeric Attributes in Hoeffding Trees. Overview. Hoeffding trees are excellent for classification tasks on data streams.
E N D
Bernhard Pfahringer, Geoff Holmes and Richard Kirkby Handling Numeric Attributes in Hoeffding Trees
Overview • Hoeffding trees are excellent for classification tasks on data streams. • Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8) • Does handling numeric attributes matter for streamed data? • We implement a range of methods and empirically evaluate their accuracy and costs. Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data Streams - reminder • Idea is that data is being provided from a continuous source: • Examples processed one at a time (inspected once) • Memory is limited (!) • Model construction must scale (NlogN in num examples) • Be ready to predict at any time • As memory is limited this will have implications for any numeric handling method you might construct • Only consider methods that work as the tree is built Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Main assumptions/limitations • Assume a stationary concept, i.e. no concept drift or change • may seem very limiting, but … • Three-way trade-off: • memory • speed • accuracy • Used only artificial data sources Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Hoeffding Trees • Introduced by Domingos and Hulten (VFDT) • “Extension” of decision trees to streams • HT Algorithm: • Init tree T to root node • For each example from stream • Find leaf L for this example • Update counts in L with attr values of example and compute split function (eg Info Gain, IG) for each attribute • If IG(best attr) – IG(next best attr) > ε then split L on best attr Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Active leaf data structure • For each class value: • for each nominal attribute: • for each possible value: • keep sum of counts/weights • for each numeric attribute: • keep sufficient stats to approximate the distribution • various possibilities: here assume normal distribution so estimate/record: n,mean,variance, + min/max Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Numeric Handling Methods • VFDT (VFML – Hulten & Domingos, 2003) • Summarize the numeric distribution with a histogram made up of a maximum number of bins N (default 1000) • Bin boundaries determined by first N unique values seen in the stream. • Issues: method sensitive to data order and choosing a good N for a particular problem • Exhaustive Binary Tree (BINTREE – Gama et al, 2003) • Closest implementation of a batch method • Incrementally update a binary tree as data is observed • Issues: high memory cost, high cost of split search, data order Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Numeric Handling Methods • Quantile Summaries (GK – Greenwald and Khanna, 2001) • Motivation comes from VLDB • Maintain sample of values (quantiles) plus range of possible ranks that the samples can take (tuples) • Extremely space efficient • Issues: use max number of tuples per summary Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Handling Numeric Methods • Gaussian Approximation (GAUSS) • Assume values conform to Normal Distribution • Maintain five numbers (eg mean, variance, weight, max, min) • Note: not sensitive to data order • Incrementally updateable • Using the max, min information per class – split the range into N equal parts • For each part use the 5 numbers per class to compute the approx class distribution • Use the above to compute the IG of that split Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 2 class problem Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 3 class problem Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 4 class problem Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Empirical Evaluation • Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC) • Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100) • Train models for 10 hours – then test on one million (holdout) examples • Define three application scenarios • Sensor network (100K memory limit) • Handheld (32MB) • Server (400MB) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data generators • Random tree (Domingos&Hulten): • (RTS) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3, max level 5, plus version with 10% noise added (RTSN) • (RTC) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5, max level 10, plus version with 10% noise added (RTCN) • Random RBF (Kirkby): • (RRBFS) 10 num, 100 centers, 2 classes • (RRBFC) 50 num, 1000 centers, 2 classes • Waveform (Aha): • (Wave21): 21 noisy num, (Wave40): +19 irrelevant num; 3 classes • (GenF1-GenF10) (Agrawal etal): • hypothetical loan applications, 10 different rule(s) over 6 num + 3 nom attrs, 5% noise, 2 classes Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Tree Measurements • Accuracy (% correct) • Number of training examples processed in 10 hours (in millions) • Number of active leaves (in hundreds) • Number of inactive leaves (in hundreds) • Total nodes (in hundreds) • Tree depth • Training speed (% of generation speed) • Prediction speed (% of generation speed) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Sensor Network (100K memory limit) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Handheld Environment (32MB memory limit) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Server Environment (400MB memory limit) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Overall results - comments • VFML10 is superior on average in all environments, followed closely by GAUSS10 • GK methods are generally competitive • BINTREE is only competitive in a server setting • Default setting of 1000 for VFML is a poor choice • Crude binning provides more space which leads to faster growth and better trees (more room to grow) • Higher values for GAUSS leads to very deep trees (in excess of the # of attributes) suggesting repeated splitting (too fine grained) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – sensor network environment • Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting) • Most accurate methods VFML10, GAUSS10 Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – Handheld Environment • Generates smaller trees (than server) and can therefore process more examples Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – Server Environment Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
VFML10 vs GAUSS10 – Closer Analysis • Recall VFML10 is superior on average • Sensor (avg 87.7 vs 86.2) • GAUSS10 superior on 10 • VFML10 superior on 6 (2 no difference) • Handheld (avg 91.5 vs 91.4) • GAUSS10 superior on 4 • VFML10 superior on 8 (6 no difference) • Server (avg 91.4 vs 91.2) • GAUSS10 superior on 6 • VFML10 superior on 6 (6 no difference) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data order Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Conclusion • We have presented a method for handling numeric attributes in data streams that performs well in empirical studies • The methods employing the most approximation were superior – they allow greater growth when memory is limited. • On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10 • Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
All algorithms available • https://sourceforge.net/projects/moa-datastream • All methods and an environment for experimental evaluation of data streams is available from the above URL – system is called Massive Online Analysis (MOA) Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group