Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

Data Intensive ComputingInformation Based ComputingDigital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE

Information Based Computing Data Mining Distributed Archives Application Collection Building Information Discovery Digital Library

Co-evolution of Technology • Supercomputer Centers and Digital Libraries • Both support large scale processing & storage of data • Will the supercomputer centers of the future be digital libraries?

Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan

Outline • Two views of computing • Executionenvironment - metacomputing systems • Data Management environment - digital library • Analysis for moving data to the process or the process to the data • Data Management Environment • Information Based Computing

Object Based Information Model Constructors: turning data sets into objects Metacomputing Environment Data Management for publication Parallel I/O - MPI Data Management for execution Data Resources Data Resources Publication / Services Environment Presentation Interface Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Execution Environment

Choice between Environments • Should we provide services for manipulating information • Move the process to the data • Should we provide execution environments • Move data to the process

Data Distribution Comparison Reduce size of data from S bytes to s bytes and analyze Data Handling Platform Supercomputer Data B b Execution rate r < R Bandwidths linking systems are B & b Operations per bit for analysis is O Operations per bit for data transfer is o Should the data reduction be done before transmission?

Distributing Services Compare times for analyzing data with size reduction from S to s Supercomputer Data Handling Platform Read Data Reduce Data Transmit Data Network Receive Data S / B O S / r o s / r s / b o s / R Supercomputer Data Handling Platform Read Data Transmit Data Receive Data Reduce Data Network S / B o S / r S / b o S / R O S / R

Processing at supercomputer T(Super) = S/B + OS/r + os/r + s/b + os/R Processing at archive T(Archive) = S/B + oS/r + S/b + oS/R + OS/R Comparison of Time

Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R Which variable provides the simplest optimization Criterion?

Scaling Parameters Data size reduction ratio s/S Execution slow down ratio r/R Problem complexity o/O Communication/Execution balance r/(ob) Note (r/o) is the number of bits/sec that can be processed. When r/(ob) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(ob) = 1

O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero. Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis

b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small. Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network

Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.

Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.

Is the Future Environment a Metacomputer or a Digital Library? • Sufficiently high complexity • Move data to processing engine • Digital Library execution of remote services • Traditional supercomputer processing of applications • Sufficiently low complexity • Move process to the data source • Metacomputing execution of remote applications • Traditional digital library service

The IBM Digital Library Architecture Application (DL client) (SRB) (MCAT) Object Server Library Server “Federated” search Videocharger DB2 ADSM Oracle Metadata in DB2 or Oracle Text and Image indices Distributed storage resources

Generalization of Digital Library • Scaling transparency • Support for arbitrary size data sets • Support for arbitrary data type • Location transparency • Access to remote data • Access to heterogeneous (non-uniform) storage systems • Remove restriction of local disk space size • Name service transparency • Support for multiple views (naming conventions) for data • Presentation transparency • Support for alternate representations of data

Describing Information Content

State-of-the-art Information Management: Digital Library

High Performance Storage • Provide access to tertiary storage - scale size of repository • Disk caches • Tape robots • Manage migration of data between disk and tape • High Performance Storage System - IBM • Provides service classes • Support for parallel I/O • Support for terabyte sized data sets • Provide recoverable name space

State-of-the-art Storage: HPSS • Store Teraflops computer output • Growth - 200 TB data per year • Data access rate - 7 TB/day = 80 MB/sec • 2-week data cache - 10 TB • Scalable control platform • 8-node SP (32 processors) • Support digital libraries • Support for millions of data sets • Integration with database meta-data catalogs

Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID RS6000 Tape Mover PVR (9490) 9490 Robot Eight Tape Drives 108 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB 9490 Robot Four Drives High Performance Gateway Node 3490 Tape SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 54 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 108 GB Trail- Blazer3 Switch HiPPISwitch Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 9490 Robot Seven Tape Drives High Node Disk Mover HiPPI driver 108 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID 54 GB Silver Node Tape / disk mover DCE / FTP /HIS Log Client SSA RAID Wide Node Disk Mover HiPPI driver Magstar 3590 Tape 108 GB MaxStrat RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon SSA RAID 160 GB 830 GB HPSS Archival Storage System

HPSS Bandwidths • SDSC has achieved: • Striping required to achieve desired I/O rates

Turning Archives into Digital Libraries • Meta-data based access to data sets • Support for application of methods (procedures) to data sets • Support for information discovery • Support for publication of data sets • Research issue - optimization of data distribution between database and archive

DB2/HPSS Integration Database Table C1 C2 C3 C4 C5 • Collaboration with IBM TJ Watson Research Center • Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala • Features: • Prototype, works with DB2 UDB (Version 5) • DB2 is able to use a HPSS file as a tablespace container • DB2 handles DCE authentication to HPSS • Regular as well as long(LOB) data can be stored in HPSS • Optional disk buffer between DB2 and HPSS DB2 DB2 Disk buffer HPSS HPSS Disk cache

Generalizing Digital Libraries • SRB - Location transparency • Access to heterogeneous systems • Access to remote systems • MCAT - Name service transparency • Extensible Schema support • MIX - Presentation transparency • Mediation of information with XML • Support for semi-structured data • Access scaling • MPI-I/O access to data sets using parallel I/O

SRB Software Architecture Application (SRB client) SRB APIs Metadata Catalog MCAT SRB User Authentication Dataset Location Access Control Type Replication Logging UniTree HPSS DB2 Illustra Unix

14 Installed SRB Sites Montana State University NCSA Rutgers Large Archives

Support for Collection hierarchy allows grouping of hetero-geneous data sets into a single logical collection hierarchical access control, with ticket mechanism Replication optional replication at the time of creation can choose replica on read Proxy operations supports proxy (remote) move and copy operations Monitoring capability Supports storing/querying of system- and user-defined “metadata” for data sets and resources API for ad hoc querying of metadata Ability to extend schemas and define new schemas Ability to associate data sets with multiple metadata schemas Ability to relate attributes across schemas Implemented in Oracle and DB2 SRB / MCAT Features

MCAT Schema Integration • Publish schema for each collection • Clusters of attributes form a table • Tables implement the schema • Use Tokens to define semantic meaning • Associate Token with each attribute • Use DAG to automate queries • Specify directed linkage between clusters of attributes • Tokens - Clusters - Attributes

Publishing A New Schema

Adding Attributes to the New Schema

Displaying Attributes From Selected Schemas

Security • Integration of SDSC Encryption Authentication system (SEA) with Globus GSI • Kerberos within security domain • Globus for inter-realm authentication • Access control lists per data set • Audit trails of usage • Need support for third-party authentication • User A accesses data under the control of digital library B when the data is stored at site C

MIX: Mediation of Information using XML Active View 1 Active View 2 BBQ Interface BBQ Interface XML data XMAS query Mediator Support for “active” views Local Data Repository XMAS query “fragment” XML data Convert XMAS query to local query language, and data in native format to XML Wrapper Wrapper Wrapper SQL Database Spreadsheet HTML files

Integration of Digital Librarywith Metacomputing Systems • NTON OC-192 network (LLNL - Caltech - SDSC) • HPSS archive • Globus metacomputing system • SRB data handling system • MCAT extensible metadata • MIX semi-structured data mediation using XML • ICE collaboration environment • Feature extraction

Data Intensive and High-Performance Distributed Computing Application Toolkits Communication Libs. Visualization Grid-enabled Libs Domain Specific Services Layer Resource Discovery Resource Brokering Scheduling Generic Services Layer INFORMATION SERVICES Interdomain Security Fault Detection End-to-End QoS Resource Management Remote Data Access Resources Layer Data Repositories Network Caching Metadata Local Resource Management

Research Activities • Support for remote execution of data manipulation procedures • Globus - SRB integration • Automated feature extraction • XML based tagging of features • XML query language for storing attributes into the Intelligent Archive • Integration with RIO - parallel I/O transport

Views of Software Infrastructure • Software infrastructure supports user applications • Reason for existence of software is to provide explicit capabilities required by applications • What is the user perspective for building new software systems? • Is the integration of digital library and metacomputing systems the final version?

Software Integration Projects • NSF • Computational Grid - Middleware using distributed state information to support metacomputing services • DOE • Data Visualization Corridor - collaboratively visualize multi-terabyte sized data sets • NASA • Information Power Grid - integrate data repositories with applications and visualization systems • DARPA • Quorum - provide quality of service guarantees

User Requirements - Five Software Environments • Code Development • Resources support • Run-time • Parallel Tools and Libraries • Distributed Run-Time • Metacomputing environment • Interaction Environments • Collaboration, presentation • Publication / Discovery / Retrieval • Data intensive computing environment

Metacomputing Environment Data Flow Perspective Application Object Oriented Interface Distributed Execution Environment Data Caching System Data Staging System Data Handling System Remote Data Manipulation Archival Storage System

Publication Environment Data Flow Perspective Application Run-time Access Data Set Constructor Digital Library Services Collection Management Software Data Handling System Remote Data Manipulation Archival Storage System

Application Parallel I/O Library Memory Tiling Data Structures Library Library Interoperation Data Caching System Data Handling System Archival Storage System Run-time Environment Data Flow Perspective

Interaction Environment Data Flow Perspective Application Collaboration Environment Visualization Environment Rendering System Data Formatting System Data Caching System Data Manipulation System Archival Storage System

Taxonomy of User Requirements

Comparison of Environments

Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services

Presentation Transcript

Digital Libraries

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce , MPI and Threading

Cloud Computing

Cloud Computing – The Value Proposition

CLOUD COMPUTING

CLOUD COMPUTING

Hybrid Cloud and Cluster Computing Paradigms for Scalable Data Intensive Applications

An Introduction to Data Intensive Computing Chapter 3: Processing Big Data

Distributed Cluster Computing Platforms

What Can Big Data and Cloud Computing do for Scientits ?

Cloud Computing

Scaling eCGA Model Building via Data-Intensive Computing

Cloud Computing

Data-Intensive Computing with MapReduce

On the Varieties of Clouds for Data Intensive Computing

Optical Computing

Cloud computing

Programming models for data-intensive computing

FermiCloud On-Demand Services: Data-intensive Computing on Public and Private Clouds

Data -intensive Computing Systems Query Optimization (Cost-based optimization)

Grid Datafarm Architecture for Petascale Data Intensive Computing

Computing with Services CS 696 – Services Computing Fall 2008