550 likes | 570 Views
Explore the merging of supercomputer centers and digital libraries, comparing execution environments, data management, and data distribution strategies to optimize computing processes. Analyze the co-evolution of technology in the context of data-intensive computing.
E N D
Data Intensive ComputingInformation Based ComputingDigital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE
Information Based Computing Data Mining Distributed Archives Application Collection Building Information Discovery Digital Library
Co-evolution of Technology • Supercomputer Centers and Digital Libraries • Both support large scale processing & storage of data • Will the supercomputer centers of the future be digital libraries?
Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan
Outline • Two views of computing • Executionenvironment - metacomputing systems • Data Management environment - digital library • Analysis for moving data to the process or the process to the data • Data Management Environment • Information Based Computing
Object Based Information Model Constructors: turning data sets into objects Metacomputing Environment Data Management for publication Parallel I/O - MPI Data Management for execution Data Resources Data Resources Publication / Services Environment Presentation Interface Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Execution Environment
Choice between Environments • Should we provide services for manipulating information • Move the process to the data • Should we provide execution environments • Move data to the process
Data Distribution Comparison Reduce size of data from S bytes to s bytes and analyze Data Handling Platform Supercomputer Data B b Execution rate r < R Bandwidths linking systems are B & b Operations per bit for analysis is O Operations per bit for data transfer is o Should the data reduction be done before transmission?
Distributing Services Compare times for analyzing data with size reduction from S to s Supercomputer Data Handling Platform Read Data Reduce Data Transmit Data Network Receive Data S / B O S / r o s / r s / b o s / R Supercomputer Data Handling Platform Read Data Transmit Data Receive Data Reduce Data Network S / B o S / r S / b o S / R O S / R
Processing at supercomputer T(Super) = S/B + OS/r + os/r + s/b + os/R Processing at archive T(Archive) = S/B + oS/r + S/b + oS/R + OS/R Comparison of Time
Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R Which variable provides the simplest optimization Criterion?
Scaling Parameters Data size reduction ratio s/S Execution slow down ratio r/R Problem complexity o/O Communication/Execution balance r/(ob) Note (r/o) is the number of bits/sec that can be processed. When r/(ob) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(ob) = 1
O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero. Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis
b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small. Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network
Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.
Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.
Is the Future Environment a Metacomputer or a Digital Library? • Sufficiently high complexity • Move data to processing engine • Digital Library execution of remote services • Traditional supercomputer processing of applications • Sufficiently low complexity • Move process to the data source • Metacomputing execution of remote applications • Traditional digital library service
The IBM Digital Library Architecture Application (DL client) (SRB) (MCAT) Object Server Library Server “Federated” search Videocharger DB2 ADSM Oracle Metadata in DB2 or Oracle Text and Image indices Distributed storage resources
Generalization of Digital Library • Scaling transparency • Support for arbitrary size data sets • Support for arbitrary data type • Location transparency • Access to remote data • Access to heterogeneous (non-uniform) storage systems • Remove restriction of local disk space size • Name service transparency • Support for multiple views (naming conventions) for data • Presentation transparency • Support for alternate representations of data
High Performance Storage • Provide access to tertiary storage - scale size of repository • Disk caches • Tape robots • Manage migration of data between disk and tape • High Performance Storage System - IBM • Provides service classes • Support for parallel I/O • Support for terabyte sized data sets • Provide recoverable name space
State-of-the-art Storage: HPSS • Store Teraflops computer output • Growth - 200 TB data per year • Data access rate - 7 TB/day = 80 MB/sec • 2-week data cache - 10 TB • Scalable control platform • 8-node SP (32 processors) • Support digital libraries • Support for millions of data sets • Integration with database meta-data catalogs
Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID RS6000 Tape Mover PVR (9490) 9490 Robot Eight Tape Drives 108 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB 9490 Robot Four Drives High Performance Gateway Node 3490 Tape SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 54 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB Trail- Blazer3 Switch HiPPISwitch Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 9490 Robot Seven Tape Drives High Node Disk Mover HiPPI driver 108 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 54 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID Wide Node Disk Mover HiPPI driver Magstar 3590 Tape 108 GB MaxStrat RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon SSA RAID 160 GB 830 GB HPSS Archival Storage System
HPSS Bandwidths • SDSC has achieved: • Striping required to achieve desired I/O rates
Turning Archives into Digital Libraries • Meta-data based access to data sets • Support for application of methods (procedures) to data sets • Support for information discovery • Support for publication of data sets • Research issue - optimization of data distribution between database and archive
DB2/HPSS Integration Database Table C1 C2 C3 C4 C5 • Collaboration with IBM TJ Watson Research Center • Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala • Features: • Prototype, works with DB2 UDB (Version 5) • DB2 is able to use a HPSS file as a tablespace container • DB2 handles DCE authentication to HPSS • Regular as well as long(LOB) data can be stored in HPSS • Optional disk buffer between DB2 and HPSS DB2 DB2 Disk buffer HPSS HPSS Disk cache
Generalizing Digital Libraries • SRB - Location transparency • Access to heterogeneous systems • Access to remote systems • MCAT - Name service transparency • Extensible Schema support • MIX - Presentation transparency • Mediation of information with XML • Support for semi-structured data • Access scaling • MPI-I/O access to data sets using parallel I/O
SRB Software Architecture Application (SRB client) SRB APIs Metadata Catalog MCAT SRB User Authentication Dataset Location Access Control Type Replication Logging UniTree HPSS DB2 Illustra Unix
14 Installed SRB Sites Montana State University NCSA Rutgers Large Archives
Support for Collection hierarchy allows grouping of hetero-geneous data sets into a single logical collection hierarchical access control, with ticket mechanism Replication optional replication at the time of creation can choose replica on read Proxy operations supports proxy (remote) move and copy operations Monitoring capability Supports storing/querying of system- and user-defined “metadata” for data sets and resources API for ad hoc querying of metadata Ability to extend schemas and define new schemas Ability to associate data sets with multiple metadata schemas Ability to relate attributes across schemas Implemented in Oracle and DB2 SRB / MCAT Features
MCAT Schema Integration • Publish schema for each collection • Clusters of attributes form a table • Tables implement the schema • Use Tokens to define semantic meaning • Associate Token with each attribute • Use DAG to automate queries • Specify directed linkage between clusters of attributes • Tokens - Clusters - Attributes
Publishing A New Schema
Adding Attributes to the New Schema
Displaying Attributes From Selected Schemas
Security • Integration of SDSC Encryption Authentication system (SEA) with Globus GSI • Kerberos within security domain • Globus for inter-realm authentication • Access control lists per data set • Audit trails of usage • Need support for third-party authentication • User A accesses data under the control of digital library B when the data is stored at site C
MIX: Mediation of Information using XML Active View 1 Active View 2 BBQ Interface BBQ Interface XML data XMAS query Mediator Support for “active” views Local Data Repository XMAS query “fragment” XML data Convert XMAS query to local query language, and data in native format to XML Wrapper Wrapper Wrapper SQL Database Spreadsheet HTML files
Integration of Digital Librarywith Metacomputing Systems • NTON OC-192 network (LLNL - Caltech - SDSC) • HPSS archive • Globus metacomputing system • SRB data handling system • MCAT extensible metadata • MIX semi-structured data mediation using XML • ICE collaboration environment • Feature extraction
Data Intensive and High-Performance Distributed Computing Application Toolkits Communication Libs. Visualization Grid-enabled Libs Domain Specific Services Layer Resource Discovery Resource Brokering Scheduling Generic Services Layer INFORMATION SERVICES Interdomain Security Fault Detection End-to-End QoS Resource Management Remote Data Access Resources Layer Data Repositories Network Caching Metadata Local Resource Management
Research Activities • Support for remote execution of data manipulation procedures • Globus - SRB integration • Automated feature extraction • XML based tagging of features • XML query language for storing attributes into the Intelligent Archive • Integration with RIO - parallel I/O transport
Views of Software Infrastructure • Software infrastructure supports user applications • Reason for existence of software is to provide explicit capabilities required by applications • What is the user perspective for building new software systems? • Is the integration of digital library and metacomputing systems the final version?
Software Integration Projects • NSF • Computational Grid - Middleware using distributed state information to support metacomputing services • DOE • Data Visualization Corridor - collaboratively visualize multi-terabyte sized data sets • NASA • Information Power Grid - integrate data repositories with applications and visualization systems • DARPA • Quorum - provide quality of service guarantees
User Requirements - Five Software Environments • Code Development • Resources support • Run-time • Parallel Tools and Libraries • Distributed Run-Time • Metacomputing environment • Interaction Environments • Collaboration, presentation • Publication / Discovery / Retrieval • Data intensive computing environment
Metacomputing Environment Data Flow Perspective Application Object Oriented Interface Distributed Execution Environment Data Caching System Data Staging System Data Handling System Remote Data Manipulation Archival Storage System
Publication Environment Data Flow Perspective Application Run-time Access Data Set Constructor Digital Library Services Collection Management Software Data Handling System Remote Data Manipulation Archival Storage System
Application Parallel I/O Library Memory Tiling Data Structures Library Library Interoperation Data Caching System Data Handling System Archival Storage System Run-time Environment Data Flow Perspective
Interaction Environment Data Flow Perspective Application Collaboration Environment Visualization Environment Rendering System Data Formatting System Data Caching System Data Manipulation System Archival Storage System