290 likes | 437 Views
A Middleware for Developing and Deploying Scalable Remote Mining Services. Leonid Glimcher Gagan Agrawal. Abundance of Data. Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics)
E N D
A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal DataGrid Lab
Abundance of Data Data generated everywhere: • Sensors (intrusion detection, satellites) • Scientific simulations (fluid dynamics, molecular dynamics) • Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab
Grid and Cloud Computing Provides access to otherwise inaccessible: • Resources • Data Goal of Grid Computing: • to provide a stable foundation for distributed services Price for such foundation: • Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption DataGrid Lab
Remote Data Analysis • Remote data analysis • Grid is a good fit Details can be very tedious: • Data retrieval, movement and caching • Parallel data processing • Resource allocation • Application configuration Middleware can be useful to abstract away details DataGrid Lab
Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab
Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab
FREERIDE-G Grid Service Grid computing standards and infrastructures: • OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab
SDSC Storage Resource Broker Standard for remote data: • Storage • Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API DataGrid Lab
FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } DataGrid Lab
FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G • ADR responsible for remote data retrieval • SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring • Load balancing • Data integration DataGrid Lab
Evolution FREERIDE-G-ADR Application Data ADR SRB Globus FREERIDE FREERIDE-G-SRB FREERIDE-G-GT DataGrid Lab
FREERIDE-G System Architecture DataGrid Lab
Implementation Challenges I • Interaction with Code Repository • Simplified Wrapper and Interface Generator • XML descriptors of API functions • Each API function wrapped in own class C++ C++ Java SWIG XML DataGrid Lab
Implementation Challenges Integration with MPICH-G2 and Globus Toolkit • Supports MPI • Deployed through pre-WS Globus components • Hides potential heterogeneity in: • service startup • management DataGrid Lab
FREERIDE-G Applications VortexPro: • Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: • Clusters points based on Euclidean distance in an attribute space EM Clustering: • Another distance-based clustering algorithm DataGrid Lab
Experimental setup Organizational Grid: • Data hosted on Opteron 250 cluster • Processed on Opteron 254 cluster • Connected using 2 10 GB optical fibers Goals: • Demonstrate parallel scalability of applications • Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster DataGrid Lab
Scalability Evaluation Scalable implementations with respect to numbers of: • Compute nodes, • Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom) DataGrid Lab
Scalability Evaluation II Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads Kmeans Clustering (25.6 GB) EM Clustering (25.6 GB) DataGrid Lab
Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18-20%. DataGrid Lab
Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: • scales with the total execution time • doesn’t effect overall data processing service scalability DataGrid Lab
Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab
Grid Computing THE GRID Standards: • Open Grid Service Architecture • Web Services Resource Framework Implementations: • Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: • Condor and Legion, • Workflow composition systems: • GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab
Data Grids Metadata cataloging: • Artemis project Remote data retrieval: • SRB, SEMPLAR • DataCutter, STORM Stream processing midleware: • GATES • dQUOB Data integration: • Automatic wrapper generation DataGrid Lab
Remote Data in Grid KnowledgeGrid tool-set: • Developing datamining workflows (composition) GridMiner toolkit: • Composition of distributed, parallel workflows DiscoveryNet layer: • Creation, deployment, management of services DataMiningGrid framework: • Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab
Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service • Compliance with grid and repository standards • Support for high-end processing • Ease use of parallel configurations • Hide details of data movement and caching • Scalable performance with respect to data and compute nodes • Low deployment overheads as compared to non-grid version DataGrid Lab
SRB Data Host SRB Master: • Connection establishment • Authentication • Forks agent to service I/O SRB Agent: • Performs remote I/O • Services multiple client API requests MCAT: • Catalogs data associated with datasets, users, resources • Services metadata queries (through SRB agent) DataGrid Lab
Compute Node More compute nodes than data hosts Each node: • Registers IO (from index) • Connects to data host While (chunks to process) • Dispatch IO request(s) • Poll pending IO • Process retrieved chunks DataGrid Lab
FREERIDE-G in Action Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab
Conversion Challenges • Interfacing C++ middleware with GT 4.0 (Java) • Swig to generate Java Native Interface (JNI) wrapper • Integrating data mining service through WSDL interface • Encapsulate service as a resource • Middleware deployment using GT 4.0 • Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab