210 likes | 230 Views
This paper explores leveraging cloud services to enhance analytical tasks within relational databases, reducing processing time, improving data security, and streamlining model management. It focuses on providing data mining algorithms as a service, overcoming drawbacks of external tools and privacy issues. Challenges like data volume and redundancy are addressed, and advantages such as workload reduction and simplified model handling are highlighted. The system attributes and components, along with job processing remarks, are detailed, emphasizing the hybrid processing approach and algorithm optimizations for efficient computation. Numerous algorithm optimizations and job scheduling strategies are proposed to accelerate data mining tasks effectively.
E N D
Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, VeeraBaladandayuthapani, ShoaibQuraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Motivation • Relational databases are a natural repository of data. • Enterprise Systems • But analytical tasks are often done outside the DBMS • Drawbacks • External data mining software • Data exporting • Privacy issues
Our proposal • Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs • DBMSs present both in the cloud and in the client side • No external packages required • Standard SQL queries , UDFs and Aggregate UDFs • A set of off-the-shelf algorithms are provided
Challenges • Large volume of data to be transmitted • Matrix computations • Processing power requirements of number crunching • Data redundancy • Minimize I/O • All data in relational format • Avoid exporting tasks
Advantages • Cloud system can: • Reduce work load on local system • Accelerate analytical processing • Enforce data security • Simplify multiple model management • It is not required to install data mining software, neither in local system nor in the cloud • Everything stored in relational tables
System attributes • Smart local processing: exploit CPU/RAM of local DBMS • Integrated:Local DBMS and Cloud DBMS are tightly integrated • Fast: one pass over input table for most algorithms; parallel • Simple: Calling the algorithms is simple: Stored Procedure with default parameters • Relational: relational tables to store models, job parameters
System Components • Cloud DBMS • Store procedures, UDFs • Cloud management server • Handling data mining job requests • Monitoring job progress • Cost estimation for 3 alternative processing modes • Managing jobs • Local DBMS • Store procedures, UDFs • Web application • User can post jobs using a web interface
Models • PCA • K-Means • Linear Regression • Variable Selection • Naïve Bayes
Remarks • Hybrid Mode: • Sufficient statistics calculated in local DBMS • Take advantage of local power processing, RAM • Cloud DBMS receives a summarization • Transmitting the entire dataset is avoided • Model computation in cloud DBMS • Cloud Model: • Summarization step • Occurs in cloud • Large data sets: Sampling • Local Mode: • Preferred for small datasets • Summarization/Sampling
Job Scheduler • FIFO job scheduling by default • If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF • If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). • As the load decreases, the system backtrack to SJF, FIFO
Algorithm Optimizations • Sufficient Statistics are exploited to accelerate data mining algorithms • Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q • Sufficient Statistics can be computed • On samples • On the whole dataset
Sufficient Statistics: nLQ/Γ • Considering a dataset with n points • The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]
Sufficient Statistics: nLQ/Γ • 1 set of sufficient statistic for each class/ cluster is necessary for: • Naïve Bayes • K-means • One matrix Γ is enough for • PCA • Linear Regression • Variable Selection
Data transfer comparison Data set Physical Activity ( n=2.88M, d=42) Dataset : 880.00 MB nLQ/Γ: 0.02 MB 50,000 times smaller!
Optimizations • Sufficient Statistics • Calculated in one parallel scan • Aggregate UDFS • Multithreaded, RAM • Matrix computations in RAM • LAPACK integration • Fast, accurate, stable
Summary • Sufficient statistics transmitted to cloud • Hybrid processing is best • Job policy: FIFO->SJF->RR • Parallel summarization, parallel scan • Model computation in RAM in the cloud • Complicated number crunching in the cloud • Job and model history in the cloud • All data is relational tables: they can be queried, stored securely
References • C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 • C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 • Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani.Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015