Memory-Constrained Data Mining

Memory-Constrained Data Mining Slobodan Vucetic Assistant ProfessorDepartment of Computer and Information Sciences Center for Information Science and Technology Temple University, Philadelphia

Scientific Data Mining LabDr. Slobodan Vucetic, Assistant ProfessorCIS Department, IST Center, Temple University, Philadelphia, USA Need: (see Nature of March 23, 2006) Amount of data in science every year Shift from computers supporting scientists to playing central role in testing, and even formulation, of scientific hypothesis Lab Mission: Developing an interface between data analysis and applied sciences Working on collaborative projects at the interface between computer science and other disciplines (sciences, engineering, business) Training students to become computational research scientists Research Tasks: Predictive Modeling Pattern Discovery Summarization

Scientific Data Mining Lab: Research Challenges • Spatial and temporal dependency • High dimensional data • Data collection bias • Data and knowledge fusion from multiple sources • Large-scale data • Missing/noisy/unstable attributes …

Scientific Data Mining Lab: Current Projects Data Mining • Resource-Constrained Data Mining (NSF) Earth Science Applications • Estimation of geophysical parameters from satellite data (NSF) Biomedical Applications • Gene expression data analysis(NIH, PA Dept. of Health) • Bioinformatics of protein disorder (PA Dept. of Health) • Bioinformatics core facility (PA Dept. of Health) • Text mining and Information retrieval (NSF) • Spatial modeling of disease and infection spread Spatial and Temporal Knowledge Discovery • Spatial-temporal data reduction (NSF) • Analysis of deregulated electricity markets • Analysis of highway traffic data

MISR: Multi-angle Imaging Spectro-Radiometer 9 view angles at Earth surface 4 Spectral bands 70.5º Da 60.0º 45.6º 26.1º Ca 0.0º Ba 26.1º Aa 45.6º 60.0º An 2800 km 70.5º Af Bf Cf 400-km swath width Df Scientific Data Mining Lab: Multiple-Source Spatial-Temporal Data Analysis Aim: Accurate and efficient estimation of geophysical parameters from MISR and MODIS instruments on Terra satellite and ground based observations (huge data streams) Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A Data Mining Approach for the Validation of Aerosol Retrievals, IEEE Geoscience and Remote Sensing Letters, 2008.

Scientific Data Mining Lab: Temporal Data Mining Aim: analyze price vs. load dependences by discovering semi-stationary segments in multivariate time series • Result: several pricing regimes existed in California market Vucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's Electricity Market," IEEE Trans. on Power Systems.

Scientific Data Mining Lab: Text Mining: Re-Ranking of Articles Retrieved by a Search Engine When topic is difficult to express as a query, often • No relevant articles are found by keyword search • Too many irrelevant articles are returned Biomedical Example: “Apurinic/apyrimidinic endonuclease”: 638citations returned by PubMed “Apurinic/apyrimidinic endonuclease disorder”: 1 citation (irrelevant) returned Result: Large lift of relevant retrievals in top 10 Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006) “Substring Selection for Biomedical Document Classification,” Bioinformatics.

Scientific Data Mining Lab: Collaborative filtering Aim: Predict preferences of an active customer given his/her preferences on some items and a database of preferences of other customers • Result: Regression-based collaborative filtering algorithm is superior to the neighbor-based approach. It is two orders of magnitude faster on-line predicting; more accurate; more robust to small number of observed votes. • Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.

Kissinger et al, 1995 Scientific Data Mining Lab: Bioinformatics: Protein Disorder Analysis Aim: Understanding protein disorder and its functions • Results: • Protein disorders are very common (contrary to a 20th century belief) • Fraction of disorder varies a lot by genomes • Different types of disorder exist in proteins • Involved with many important functions Vucetic, S., Brown C., Dunker A.K and Obradovic, Z., Flavors of Protein Disorder, Proteins: Structure, Function and Genetics, Vol. 52, pp. 573-584, 2003.

Scientific Data Mining Lab: Analysis of Highway Traffic Data Aim: understand traffic patters, predict traffic congestion and delays In progress…

Scientific Data Mining Lab: Spatio-Temporal Disease Modelling Aim: predict infection or disease risk, given the information about population movement • Result: movement information is very useful in prediction of the infection risk • Vucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.

Scientific Data Mining Lab: Resource-Constrained Data Mining Aim: • Efficient knowledge discovery from large data by limited-capacity computing devices Approach: • Integration of data mining and data compression Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy

Resource-Constrained Data Mining:Motivation • Data mining objective: • Efficient and accurate algorithms for learning from large data • Performance measures: • Accuracy • Scaling with data size (# examples, #attributes) • Mainstream data mining: • many accurate learning algorithms that scale linearly or even sub-linearly with data size and dimension, in both runtime and space • Caveat: • linear space scaling is often not sufficient  it implies an unbounded growth in memory with data size • Challenge: • how to learn from large, or practically infinite, data sets/streams using limited memory resources

Resource-Constrained Data Mining:Learning Scenario • Examples are observed • sequentially • in a single pass • Data stream examples • independent and identically distributed (IID) • Could store the data summary in • reservoir with fixed memory

Resource-Constrained Data Mining:Approaches • Model-Free: Reservoir Approach • Maintains a random sample of size R from data stream • Add xt with min(1, R/t), remove randomly • Caveat: random sampling often not optimal • Data-Free: Online algorithms • Updates the model as examples are observed • Perceptron: wt+1 = wt + (yt- f(xt))xt, where f(x) = wTx • Caveat: sensitive to data ordering • Hybrid: Data + Model • Implicitly done with Support Vector Machines (SVMs)

Resource-Constrained Data Mining:Objective  Develop a memory-constrained SVM algorithm • What is SVM? • Popular data mining algorithm for classification • The most accurate on many problems • Theoretically and practically appealing • Computationally expensive • Cubic training time cost O(N3) (e.g. neural nets are O(N)) • Quadratic training memory cost O(N2) (e.g. neural nets are O(N)) • Linear prediction cost O(N) (e.g. neural nets are O(1))

Goal: Use x1 and x2 to predict class y  {-1, 1} Assume linear prediction function f(x) = w1x1+w2x2+b sign(f(x)) is final prediction Challenge: What is better, f1(x) or f2(x) What is the best choice for f(x)? Answer: Best f(x) has the most wiggle space  it has largest margin x2 f1(x) f2(x) x1 Resource-Constrained Data Mining:SVM Overview

Resource-Constrained Data Mining:SVM Overview • Maximizing margin is equivalent to: minimize ||w||2 such thatyi f(xi)  1 • What if data are noisy? minimize ||w||2 + Cii such that yi f(xi)  1 -i, i 0 • What if problem is nonlinear? X  (X)

Resource-Constrained Data Mining:SVM Overview • Standard approach  convert to dual problem minimize ||w||2 + Cii such that yi f(xi)  1 -i, i 0 where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A) i are Lagrange multipliers • Optimization becomes the Quadratic Programming Problem (minimizing convex function with linear constraints) • There is the optimal solution in O(N3) time and O(N2) space • SVM predictor: To predict class of example x, we should compare it with all training examples with i> 0

Resource-Constrained Data Mining:SVM Overview Support vectors 0<i<C Error vectors i=C Reserve vectors i=0 f(x) = -1 f(x) = +1

Resource-Constrained Data Mining:Incremental-Decremental SVM • Standard SVM solution is “batch”, meaning that all training data should be available for learning • Alternative is “online” SVM that can be update when new training data are available • Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000] • For each new example, the update takes • O(Ns2) time, Ns – number of support vectors (0<i<C) • O(NsN) memory. Considering Ns = O(N), memory is O(N2) • Total cost for online training on N examples is • O(N3) time • O(N2) memory • The same as for batch mode

Resource-Constrained Data Mining: Memory-Constrained IDSVM • Idea • Modify IDSVM by upper-bounding number of support vectors • How  Twin Vector Machine (TVM) • Define budget B and a set of pivot vectorsq1…qB • Quantize each example to its nearest pivot, Q(x) = {qk, k = arg minj=1:B ||x-qj||} D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N} • Training SVM on Q(D) is equivalent to SVM on TV, TV = {TVj, j = 1…B} (Twin Vector Set) TVj = {(qj,+1,nj+}, (qj,-1,nj-)} (Twin Vector) • O(N3)  O(B3) (constant) time; O(N2)  O(B2) (constant) memory • minimize ||w||2 + Cii • such that yi f(xi)  1 - i, • i0, i = 1…N • minimize ||w||2 + Cj(nj+j+ +nj-j-) • such that f(qj)  1 -j,-f(qj)  1 - j-, • j+, j- 0, j = 1…B

Resource-Constrained Data Mining: Online TVM Online-TVM • Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel function K, slack parameter C • Output: TVM with parameters 1+,1-,… B+,B-, and b • Initialize TVM = 0, TV =  • for i = 1 to N • if Beneficial(xi) • Update-TV • Update-TVM

Resource-Constrained Data Mining: Online TVM Beneficial • if size(TV) < B or |f(xi)| m1 • return 1 • else • return 0 Online-TVM • Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel function K, slack parameter C • Output: TVM with parameters 1+,1-,… B+,B-, and b • Initialize TVM = 0, TV =  • for i = 1 to N • if Beneficial(xi) • Update-TV • Update-TVM m1

Resource-Constrained Data Mining: Online TVM m2 (**)

Resource-Constrained Data Mining: Online TVM Merging Heuristics: • Nearest versus Weighted • Global versus One-Sided • Rejection merging

Resource-Constrained Data Mining:Results Budget B = 100

Resource-Constrained Data Mining:Results

Resource-Constrained Data Mining:Results Budget B = 100

Resource-Constrained Data Mining:Results

Resource-Constrained Data Mining:Conclusions • Memory-Constrained SVM is successful • Significantly higher accuracy than baseline • Close to the optimal approach • Merging heuristics are very important • Future work • Further improvements • Forgetting • Probabilistic merging • Use data compression • Non-IID streams

Thank You! More information: http://www.ist.temple.edu/~vucetic/ Collaboration/assistantship contact: Slobodan Vucetic CIS Department, IST Center, Temple University vucetic@ist.temple.edu

Memory-Constrained Data Mining