Scaling Decision Tree Induction

Scaling Decision Tree Induction

Outline • Why do we need scaling? • Cover state of the art methods • Details on my research (which is one of the state of the art methods)

Problems Scaling Decision Trees • Data doesn’t fit in RAM • Numeric attributes require repeated sorting • Noisy datasets lead to very large trees • Large datasets fundamentally different from smaller ones • Can’t store the entire dataset • Underlying phenomenon changes over time

Current State-Of-The-Art • Disk based methods • Sprint • SLIQ • Sampling methods • BOAT • VFDT & CVFDT • Data Stream Methods • VFDT & CVFDT

SPRINT/SLIQ • Shafer, Agrawal, Mehta • In the IBM Intelligent Miner for Data • Learns the same tree as traditional method but works with data on disk • One scan over the data per level of the induced tree

SPRINT/SLIQ Details • Split the dataset into one file per attribute • (value, record ID) • Pre-sort each numeric attribute’s file • Do one scan over each file, find best split point • Use hash-tables to split the files maintaining sort order • Recur

SPRINT/SLIQ Splitting Example Test Attrib To Split > val | rec val | rec 10 | 1 3 | 3 10 | 1 14 | 6 5 | 2 14 | 6 20 | 4 6 | 5 20 | 2 9 | 1 25 | 4 10 | 4 30 | 3 12 | 6 40 | 5 < ‘hashtable’ 20 | 2 1 | > 30 | 3 2 | < 40 | 5 3 | < 4 | > 5 | < 6 | >

BOAT • Gehrke, Ganti, Ramakrishnan, Loh • Learns the same tree as traditional methods but can be as much as 3x faster than SPRINT/SLIQ • When things work out learns more than one level of tree in one scan over the database

BOAT Details • Read a sample of data into memory • Learn N trees via traditional methods on bootstrap samples from this sample • Keep any subset of the N trees that is exactly the same • Verify the subtree with a scan over all data • When this fails revert to SPRINT/SLIQ

BOAT Example x1? x1? x1? male female male female male female x2? x2? x2? > 65 <= 65 > 67 <= 67 > 61 <= 61 x3? no yes x1? male female x2? > 61 <= 67 ?

VFDT/CVFDT • Hulten, Spencer, Domingos • With high probability learns what traditional methods would learn, but much faster • Learns from data stream instead of data base • CVFDT is extension to time changing concepts

Motivation • Why use a data stream model? • High data rate • Essentially infinite data • Data collected in varied circumstances • Need a algorithms that are: • Constant time per example & use each example once • Incremental • Anytime • Produce results ‘equivalent’ to traditional methods

Hoeffding Trees • In order to pick split attribute for a node looking at a few example may be sufficient • Given a stream of examples: • Use the first to pick the split at the root • Sort succeeding ones to the leaves • Pick best attribute there • Continue… • Leaves predict most common class

How Much Data? • Make sure best attribute is better than second • That is: • Using a statistical result: Hoeffding bound • Collect data till:

Hoeffding Tree Algorithm Proceedure HoeffdingTree(Stream, δ) Let HT = Tree with single leaf (root) Initialize sufficient statistics at root For each example (X, y) in Stream Sort (X, y) to leaf using HT Update sufficient statistics at leaf Compute G for each attribute If G(best) – G(2nd best) > ε, then Split leaf on best attribute For each branch Start new leaf, init sufficient statistics Return HT x1? male female y=0 x2? > 65 <= 65 y=0 y=1

Properties of Hoeffding Trees • Model may contain incorrect splits, useful? • Bound the difference with infinite data tree • Chance an arbitrary example takes different path • Intuition: example on level i of tree has i chances to go through a mistaken node

VFDT (Very Fast Decision Tree) • Memory management • Memory dominated by sufficient statistics • Deactivate less promising leaves when needed • Ties: • Wasteful to decide between identical attributes • Check for splits periodically • Pre-pruning (optional) • Only make splits that improve the value of G(.) • Early stop on bad attributes • Bootstrap with traditional learner • Rescan old data when time available

Experiments • Compared VFDT and C4.5 (Quinlan, 1993) • Same memory limit for both (40 MB) • 100k examples for C4.5 • VFDT settings: δ = 10^-7, τ = 5% • Domains: 2 classes, 100 binary attributes • Fifteen synthetic trees 2.2k – 500k leaves • Noise from 0% to 30%

Running Times • Pentium III at 500 MHz running Linux • C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds • VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process • VFDT processes 32k examples per second (excluding I/O)

Time-Changing Data Streams • Underlying concept often changes over time • Seasonal effects • Economic cycles • Etc. • Many KDD systems assume data is sample from stationary distribution • CVFDT -- Extends VFDT for time changing data streams

Dealing with Time Changing Concepts • Out-of-date data misleads learner and results in larger or less accurate models • Maintain a window of the most recent examples • When new data arrives update the window and reapply the learner • Effective when window size similar to concept drift rate • Extremely inefficient!

Concept adapting VFDT • Keep up to date with a window of size w • Incrementally incorporate and forget examples • Smoothly change the induced tree • Grow speculative structure • Change structure when more accurate • Incorporates new examples in constant time instead of relearning on window: O(w) time

Window (Forgetting Examples) • Keep sufficient statistics at every node • Update with new & old examples • Keep an ID and only forget where needed • Quickly update leaf predictions • Periodically check for any invalid splits • Some portion due to incorrect initial splits • The rest due to changes in the data stream

Alternate Sub-Trees • When new test looks better grow alternate sub-tree • Replace the old when new is more accurate • This smoothly adjusts to changing concepts Gender? Pets? College? false Hair? true true false false true

CVFDT Details • Memory Requirements • When drift present, CVFDT uses fewer nodes than VFDT • Observed good results with relatively few alternate-trees • Update time • O(# attribs * # values * # classes * path length) • Independent of training set and window size!

Other things • Dynamic window size • Drastic changes in the data stream • Drastic changes in the induced model • No apparent changes (learn more detail)

Synthetic Experiments • Concept based on parallel hyper-planes • Aligned axis better split attribute, rotate the hyper-planes to change structure of ‘true’ tree + + - - Concept Drift + + - -

Synthetic Experiments (cont.) • Compare CVFDT with VFDT • 5 million training examples • Drift inserted by periodically rotating hyper-planes • About 8% of test points change label each drift • 100,000 examples in window • 5% noise • Results sampled every 10k examples throughout the run and averaged

Error Rate vs. # Attributes

Tree Size vs. # Attributes

Detailed View of Single Run

Varying Levels of Drift

Details of Adaptation

Comparison With VFDT-window • CVFDT most of the accuracy gain • VFDT: 10 min • CVFDT: 46 min • VFDT-window • Est. 548 days! VFDT-Window VFDT CVFDT

Application: Web Data • Trace of all web requests from UW campus • 82.8 million requests over one-week period • Goal: to predict which pages to cache • CVFDT does better for first 70% of run • VFDT’s performance improved near end • Data seems to contain drift, but more study is needed

Open Issues • Continuous Attributes • Batch version of VFDT • Very Fast Post Pruning • Extending general method to other algorithms

Summary • Decision trees important, need some more work to scale to today's problems • Disk based methods • About one scan per level of tree • Sampling can produce equivalent trees much faster

Scaling Decision Tree Induction