420 likes | 579 Views
Systematic Data Selection to Mine Concept Drifting Data Streams. Wei Fan IBM T.J.Watson. About … …. Data streams: continuous stream of new data, generated either in real time or periodically. Credit card transactions Stock trades. Insurance claim data. Phone call records Our notations.
E N D
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson
About …… • Data streams: continuous stream of new data, generated either in real time or periodically. • Credit card transactions • Stock trades. • Insurance claim data. • Phone call records • Our notations.
t2 t3 t4 t5 t1 Old data Data Streams New data
Data Stream Mining • Characteristics: may change over time. • Main goal of stream mining: • make sure that the constructed model is the most accurate and up-to-date.
Data Sufficiency • Definition: • A dataset is considered “sufficient” if adding more data items will not increase the final accuracy of a trained model significantly. • We normally do not know if a dataset is sufficient or not. • Sufficiency detection: • Expensive “progressive sampling” experiment. • Keep on adding data and stop when accuracy doesn’t increase significantly. • Dependent on both dataset and algorithm • Difficult to make a general claim
Possible changes of data streams • Possible“concept drift”. • For the same feature vector, different class labels are generated at some later time • Or stochastically, with different probabilities. • Possible“data sufficiency”. • Other possible changes not addressed in our paper. • Most important of all: • These are “possibilities”. • No “Oracle” out there to tell us the truth! • Dangerous to make assumptions.
How many combinations? • Four combinations: • Sufficient and no drift. • Insufficient and no drift. • Sufficient and drift. • Insufficient and drift • Question: Does the “most accurate model” remain the same under all four situations?
Case 1: Sufficient and no drift • Solution one: • Throw away old models and data. • Re-train a new model from new data. • By definitions of data sufficiency. • Solution two: • If old model is trained from “sufficient data”, just use the old model
Case 2: Sufficient and drift • Solution one: • Train a new model from new data • Same “sufficiency definition”.
Case 3: Insufficient and no drift • Possibility I: if old model is trained from sufficient data, keep the old model. • Possibility II: otherwise, combine new data and old data, and train a new model.
Case 4: Insufficient and drift • Obviously, new data is not enough by definition. • What are our options. • Use old data? • But how?
See any problems? • Which old data items can we use?
See more problems? • We normally never know which of the four cases a real data stream belongs to. • It may change over time from case to case. • Normally, no truth is known apriori or even later.
Solution • Requirements: • The right solution should not be “one size fits all.” • Should not make any assumptions. Any assumptions can be wrong. • It should be adaptive. • Let the data speak for itself. • We prefer model A over model B if the accuracy of A on the evolving data stream is likely to be more accurate than B. • No assumptions!
An “Un-biased” Selection framework • Train FN from new data. • Train FN+ from new data and selected consistent old data. • Assume FO is the previous most accurate model. Update FO using the new data. Call it FO+. • Use cross-validation to choose among the four candidate models {FN, FN+, FO, and FO+}.
Consistent old data • Theoretically, if we know the true models, we can use the true models to choose consistent data. But we don’t • Practically, we have to rely on “optimal models.” • Go back to the hyper plane example
True model and optimal models • True model. • Perfect model: never makes mistakes. • Not always possible due to: • Stochastic nature of the problem • Noise in training data • Data is insufficient • Optimal model: defined over a given loss function.
Optimal Model • Loss function L(t,y) to evaluate performance. • t is true label and y is prediction • Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times: • 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud • cost-sensitive loss: the label that minimizes the “empirical risk”. • If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud
Random decision trees • Train multiple trees. Details to follow. • Each tree outputs posterior probability when classifying an example x. • The probability outputs of many trees are averaged as the final probability estimation. • Loss function and probability are used to make the best prediction.
Training • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
Example Gender? M F Age>30 P: 1 N: 9 n y …… P: 100 N: 150 Age> 25
Training: Continued • We stop when one of the following happens: • A node becomes empty. • Or the total height of the tree exceeds a threshold, currently set as the total number of features. • Each node of the tree keeps the number of examples belonging to each class.
Classification • Each tree outputs membership probability • p(fraud|x) = n_fraud/(n_fraud + n_normal) • If a leaf node is empty (very likely for when discrete feature is tested at the end): • Use the parent nodes’ probability estimate but do not output 0 or NaN • The membership probability from multiple random trees are averaged to approximate as the final output • Loss function is required to make a decision • 0-1 loss: p(fraud|x) > 0.5, predict fraud • cost-sensitive loss: p(fraud|x) $1000 > $90
N-fold Cross-validation with Random Decision Trees • Tree structure is independent from the data. • Compensation when computing probability
Key advantage • n-fold cross validation comes easy. • Same cost as testing the model once on the training data. • Training is efficient since we do not compute information gain. • It is actually also very accurate.
Experiments • I have a demo available to show. Please contact me. • In the paper. I have the following experiments. • Synthetic datasets. • Credit card fraud datasets. • Donation datasets.
Compare • This new selective framework proposed in this paper. • Our last year’s hard coded ensemble framework. • Use k number of weighted ensembles. • K=1. Only train on new data. • K=8. • Use new data and previous 7 periods of model. • Classifier is weighted against new data. • Sufficient and insufficient. Always drift.
Independent study and implementation of random decision tree • Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets • Edward Greengrass from DOD on their data sets. • 100 to 300 features. • Both categorical and continuous features. • Some features have a lot of values. • 2000 to 3000 examples. • Both binary and multiple class problem (16 and 25)
Related publications on random trees • “Is random model better? On its accuracy and efficiency” ICDM 2003 • “On the optimality of probability estimation by random decision trees” AAAI 04. • “Mining concept-drifting data streams with ensemble classifiers” SIGKDD2003