180 likes | 303 Views
Active Mining of Data Streams. Wei Fan , Yi-an Huang , Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004. Speaker: Pei-Min Chou Date:2005/01/14. Introduction. In most real-world problems, labelled data streams are rarely immediately available
E N D
Active Mining of Data Streams Wei Fan , Yi-an Huang , Haixun Wang and Philip S. Yu Proc. SIAM International Conference on Data Mining 2004 Speaker: Pei-Min Chou Date:2005/01/14
Introduction • In most real-world problems, labelled data streams are rarely immediately available • models are refreshed periodically • we propose a new concept of demand-driven active data mining.
Method • Step1:Detect potential changes of data streams ---”Guess” • Step2:If guessed loss or error rate higher than tolerable maximum---choose a small number of data records • Step3:If statistically estimated loss higher than tolerable maximum---Reconstruct the old model
Definition(1) • Dc:complete data set • D:training set • S:data stream • dt:Decision tree constructed from D • Tolerable Maximum: Exact values are completely defined by each application
Definition(2) • nl :number of instance classified by leaf l • N:size of data stream • Statistic at leaf l • Σp(l)=1
Example D:training set Dc:complete set
Example---decision tree Bank is ICE no yes Local is A Bank is IBE no yes no Local is B Price is 100 no yes yes yes no C1: Billy Tom C6: Paul Amy C2: Mary C3: Ella C4: John C5 1/7 1/7 1/7 0 2/7 PD(l)=2/7
Observable Statistics(1) • ps(l):statistic at leaf l in S • pD(l): statistic at leaf l in D • Change of leaf statistic on data stream • PS means that significant change occur
Example(2) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no C1 C2: ErinHebe C4: Boss Sam C5: JoJo C6 C3 S: New data stream 0 2/5 0 2/5 1/5 Ps(l)=0
Observable Statistics(2) • La:validation loss • Le:sum of expected loss at every leaf • LS:potential change in loss due to changes in the data stream • Difference :LS take the loss function into account
Example(3) Bank is ICE no yes Local is A Bank is IBE yes no Price is 100 Local is B yes yes no yes no 30% C1 C2 C4: Boss Sam C5: JoJo C6: C3 S: New data stream Hebe Erin Major 0.7 Le(C2)=(1-0.7)*30%=9%
Loss Estimation • When two statistics above tolerable maximum occur • Investigate true class labels of a selected number of example • Assume loss of each example:{l1. l2. l3…. ln} • Average loss : Σli/n • Standard error: ( ) • Investigation cost :not for free
Experiment(1) Changing statistics is good indicator of change
Experiment---Result • Two statistics are very well correlated with the amount of change • Statistically estimated loss range is very close to true value
Conclusion • Estimates the error without knowing the true class labels • statistical sampling method to estimate the range of true loss • Model reconstruction whenever estimated loss is higher than tolerable maximum.