CS Buzzwords / “The Grid and the Future of computing”

CS Buzzwords/ “The Grid and the Future of computing” Scott A. Klasky sklasky@pppl.gov

Why? • Why do you have to program in a language which doesn’t let you program in equations? • Why do you have to care about the machine you are programming on? • Why do you care which machine the computer runs on? • Why can’t you visualize/analyze your data as soon as the data is produced? • Why do you run your codes at NERSC? • Silly question for those who use 100’s/1000’s of processors. • Why do your results from your analysis don’t always get stored in a database? • Why can’t the computer do the data analysis for you, and have it ask you questions? • Why are people still talking about vector computers? I just don’t have TIME!!! COLLABORATION IS THE KEY! sklasky@pppl.gov

Scott’s view of computing (HYPE) Why can’t we program in high level languages? • RNPL (Rapid Numerical Programming Language) http://godel.ph.utexas.edu/Members/marsa/rnpl/users_guide/node4.html • Mathmatica/Maple • Use object oriented programming to manage memory, state, etc. • This is the framework for your code. • You write modules in this framework. • Use F90/F77/C, … as modules for the code. • These modules can be reused for multiple codes, multiple authors. • Compute Fundamental Variables on main computers, other variables on secondary computers. Cactus code is a good example (2001 Gordon Bell Prize Winner) • What are the benefits? • Let the CS people worry about memory management, data I/O, visualization, security, machine locations… • Why should you care about the machine you are running on? • All you should care about is running you code, and getting your “accurate” results as fast as possible. sklasky@pppl.gov

Fortran, HPF, C, C++, Java MPI, MPI-G2, OpenMP Python, PERL, TCL/TK HTML, SGML, XML JavaScript, DHTML FLTK (Fast Light Toolkit) “The Grid” Globus Web Services DataMining WireGL, Chromium AccessGrid Portals (Discover Portal) CCA SOAP (Simple Object Access Protocol) A way to create widely distributed, complex computing environments that run over the Internet using existing infrastructure. It is about applications cumminicating directly with each other over the Internet in a very rich way). HTC (High Throughput Computing) Deliver large amounts of processing capacity over long periods of time CONDOR(http://www.cs.wisc.edu/condor/) Goal develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Buzzwords… sklasky@pppl.gov

Cactus “flesh” Cactus (http://www.cactuscode.org) (Allen, Dramlitsch, Seidel, Shalf, Radke) • Modular, portable framework for parallel, multidimensional simulations • Construct codes by linking • Small core (flesh): mgmt services • Selected modules (thorns): Numerical methods, grids & domain decomps, visualization and steering, etc. • Custom linking/configuration tools • Developed for astrophysics, but not astrophysics-specific • They have: • Cactus Worms • Remote monitoring and steering of an application from any web browser • Streaming of isosurfaces from a simulation, which can then be viewed on a local machine • Remote visualization of 2D slices from any grid function in a simulation as jpegs in a web browser • Accessible MPI-based parallelism for finite difference grids • Access to a variety of supercomputing architectures and clusters • Several parallel I/O layers • Fixed and Adaptive mesh refinement under development • Elliptic solvers • Parallel interpolators and reductions • Metacomputing and distributed computing Thorns sklasky@pppl.gov

Discover Portal • http://tassl-pc-5.rutgers.edu/discover/main.php • Discover is a virtual, interactive and collaborative PSE • Enables geographically distributed scientists and engineers to collaboratively monitor, and control high performance parallel/distributed applications using web-based portals. • Its primary objective is to transform high-performance simulation into true research and instructional modalities… • Bring large distributed simulations to the scientists’/engineers’ desktop by providing collaborative web-based portals for interaction and control. • Provides a 3-tier architecture composed of detachable thin-clients at the front-end, a network of web servers in the middle, and a control network of sensors, actuators, interaction agents superimposed on the application at the back-end. sklasky@pppl.gov

MPICH-G2(http://www.hpclab.niu.edu/mpi/) • What is MPICH-G2? • It is a grid-enabled implementation of MPI v1.1 standard. • Using Globus services (job startup, security), MPICH-G2 allows you to couple multiple machines, • MPICH-G2 automatically converts data in messages sent between machines of different architectures and supports multiprotocol communication by automatically selecting TCP for intermachine messaging and vendor-supplied MPI for intramachine messaging sklasky@pppl.gov

Components of an AG Node RGB Video Digital Video DisplayComputer Digital Video NETWORK Video CaptureComputer NTSC Video AudioCaptureComputer Analog Audio Digital Audio Control Computer EchoCanceller Mixer Accessgrid Supporting group-to-group interaction across the Grid http://www.accessgrid.org Over 70 AG sites (PPPL will be next!) • Extending the Computational Grid • Group-to-group interactions are different from and more complex than individual-to-individual interactions. • Large-scale scientific and technical collaborations often involve multiple teams working together. • The Access Grid concept complements and extends the concept of the Computational Grid. • The Access Grid project aims at exploring and supporting this more complex set of requirements and functions. • An Access Grid node involves 3-20 people per site. • Access Grid nodes are “designed spaces” that support the high-end audio/video technology needed to provide a compelling and productive user experience. • The Access Grid consists of large-format multimedia display, presentation, and interaction software environments; interfaces to grid middleware; and interfaces to remote visualization environments. • With these resources, the Access Grid supports large-scale distributed meetings, collaborative teamwork sessions, seminars, lectures, tutorials, and training. • Providing New Capabilities • The Alliance Access Grid project has prototyped a number of Access Grid Nodes and uses these nodes to conduct remote meetings, site visits, training sessions and educational events. • Capabilities will include • high-quality multichannel digital video and audio, • prototypic large-format display • integrated presentation technologies (PowerPoint slides, mpeg movies, shared OpenGL windows), • prototypic recording capabilities • integration with Globus for basic services (directories, security, network resource management), • macroscreen management • integration of local desktops into the Grid • multiple session capability sklasky@pppl.gov

Access Grid sklasky@pppl.gov

Chromium • http://graphics.stanford.edu/~humper/chromium_documentation/ • Chromium is a new system for interactive rendering on clusters of workstations. • It is a completely extensible architecture, so that parallel rendering algorithms can be implemented on clusters with ease. • We are still using WireGL, but will be switching to Chromium. • Basically, it will allow us to run a program which uses OpenGL, and have it display on a cluster tiled display wall. • There are parallel API’s! sklasky@pppl.gov

Allows for direct low overhead interaction among components in same address space Allows High-Performance Parallel Components to attach: Application/Model Coupling and More Comples/Multi-Phase Parallel Algorithms. • High-Performance Components Being Connected. • Interoperability of Components Created by Different Groups. • Multiple Frameworks Interacting (ESI & CUMULVS). • Attaching Parallel Components Together. Common Component Architecture(http://www.acl.lanl.gov/cca/) • Goal: provide interoperable components and frameworks for rapid construction of complex, high-performance applications. • CCA is needed because existing component standards (EJB, CORBA, COM) are not designed for large-scale, high-performance computing or parallel components. • The CCA will leverage existing standards' infrastructure such as name service, event models, builders, security, and tools. sklasky@pppl.gov

Requirements of Component Architectures for High-Performance Computing • Component characteristics. The CCA will be used primarily for high-performance components of both coarse and fine grain, implemented according to different paradigms such as SPMD-style as well as shared memory multi-threaded models. • Heterogeneity. Whenever technically possible, the CCA should be able to combine within one multi-component application components executing on multiple architectures, implemented in different languages, and using different run-time systems. Furthermore, design priorities should be geared towards addressing software needs most common in HPC environment; for example interoperability with languages popular in scientific programming such as Fortran, C and C++ should be given priority. • Local and remote components. Whenever possible we would like to stage interoperability of both local and remote components and be able to seamlessly change interactions from local to remote. We will address the needs both of remote components running over a local area network and wide area network; component applications running over the HPC grid should be able to satisfy real-time constraints and interact with diverse supercomputing schedulers. • Integration. We will try to make the integration of components as smooth as possible. In general it should not be necessary to develop a component specially to integrate with the framework, or to rewrite an existing component substantially. • High-Performance. It is essential that the set of standard features agreed on contain mechanisms for supporting high-performance interactions; whenever possible we should be able to avoid extra copies, extra communication or synchronization and encourage efficient implementation such as parallel data transfers. • Openess. The CCA specification should be open, and used with open software. In HPC this flexibility is needed to keep pace with the ever-changing demands of the scientific programming world. sklasky@pppl.gov

“The Grid” (http://www.globus.org) • The Grid Problem • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location, • central control, • omniscience, • existing trust relationships. sklasky@pppl.gov

Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic sklasky@pppl.gov

Why Grids? • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data • Civil engineers collaborate to design, execute, & analyze shake table experiments • Climate scientists visualize, annotate, & analyze terabyte simulation datasets • An emergency response team couples real time data, weather model, population data • A multidisciplinary analysis in aerospace couples code and data in four companies • A home user invokes architectural design functions at an application service provider • An application service provider purchases cycles from compute cycle providers • Scientists working for a multinational soap company design a new product • A community group pools members’ PCs to analyze alternative designs for a local road sklasky@pppl.gov

Online Access to Scientific Instruments Advanced Photon Source wide-area dissemination desktop & VR clients with shared controls real-time collection archival storage tomographic reconstruction DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago sklasky@pppl.gov

~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech sklasky@pppl.gov

Broader Context • “Grid Computing” has much in common with major industrial thrusts • Business-to-business, Peer-to-peer, Application Service Providers, Storage Service Providers, Distributed Computing, Internet Computing… • Sharing issues not adequately addressed by existing technologies • Complicated requirements: “run program X at site Y subject to community policy P, providing access to data at Z according to policy Q” • High performance: unique demands of advanced & high-performance systems sklasky@pppl.gov

Why Now? • Moore’s law improvements in computing produce highly functional end-systems • The Internet and burgeoning wired and wireless provide universal connectivity • Changing modes of working and problem solving emphasize teamwork, computation • Network exponentials produce dramatic changes in geometry and geography sklasky@pppl.gov

Network Exponentials • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins. sklasky@pppl.gov

The Globus Project™Making Grid computing a reality • Close collaboration with real Grid projects in science and industry • Development and promotion of standard Grid protocols to enable interoperability and shared infrastructure • Development and promotion of standard Grid software APIs and SDKs to enable portability and code sharing • The Globus Toolkit™: Open source, reference software base for building grid infrastructure and applications • Global Grid Forum: Development of standard protocols and APIs for Grid computing sklasky@pppl.gov

Identity & authentication Authorization & policy Resource discovery Resource characterization Resource allocation (Co-)reservation, workflow Distributed algorithms Remote data access High-speed data transfer Performance guarantees Monitoring Adaptation Intrusion detection Resource management Accounting & payment Fault management System evolution Etc. Etc. … One View of Requirements sklasky@pppl.gov

“Three Obstacles to Making Grid Computing Routine” • New approaches to problem solving • Data Grids, distributed computing, peer-to-peer, collaboration grids, … • Structuring and writing programs • Abstractions, tools • Enabling resource sharing across distinct institutions • Resource discovery, access, reservation, allocation; authentication, authorization, policy; communication; fault detection and notification; … Programming Problem Systems Problem sklasky@pppl.gov

Programming & Systems Problems • The programming problem • Facilitate development of sophisticated apps • Facilitate code sharing • Requires prog. envs: APIs, SDKs, tools • The systems problem • Facilitate coordinated use of diverse resources • Facilitate infrastructure sharing: e.g., certificate authorities, info services • Requires systems: protocols, services • E.g., port/service/protocol for accessing information, allocating resources sklasky@pppl.gov

The Systems Problem:Resource Sharing Mechanisms That … • Address security and policy concerns of resource owners and users • Are flexible enough to deal with many resource types and sharing modalities • Scale to large number of resources, many participants, many program components • Operate efficiently when dealing with large amounts of data & computation sklasky@pppl.gov

Aspects of the Systems Problem • Need for interoperability when different groups want to share resources • Diverse components, policies, mechanisms • E.g., standard notions of identity, means of communication, resource descriptions • Need for shared infrastructure services to avoid repeated development, installation • E.g., one port/service/protocol for remote access to computing, not one per tool/appln • E.g., Certificate Authorities: expensive to run • A common need for protocols & services sklasky@pppl.gov

Hence, a Protocol-Oriented View of Grid Architecture, that Emphasizes … • Development of Grid protocols & services • Protocol-mediated access to remote resources • New services: e.g., resource brokering • “On the Grid” = speak Intergrid protocols • Mostly (extensions to) existing protocols • Development of Grid APIs & SDKs • Interfaces to Grid protocols & services • Facilitate application development by supplying higher-level abstractions • The (hugely successful) model is the Internet sklasky@pppl.gov

The Data Grid Problem “Enable a geographically distributed community [of thousands] to perform sophisticated, computationally intensive analyses on Petabytes of data” sklasky@pppl.gov

Major Data Grid Projects sklasky@pppl.gov

Data Intensive Issues Include … • Harness [potentially large numbers of] data, storage, network resources located in distinct administrative domains • Respect local and global policies governing what can be used for what • Schedule resources efficiently, again subject to local and global constraints • Achieve high performance, with respect to both speed and reliability • Catalog software and virtual data sklasky@pppl.gov

Data IntensiveComputing and Grids • The term “Data Grid” is often used • Unfortunate as it implies a distinct infrastructure, which it isn’t; but easy to say • Data-intensive computing shares numerous requirements with collaboration, instrumentation, computation, … • Security, resource mgt, info services, etc. • Important to exploit commonalities as very unlikely that multiple infrastructures can be maintained • Fortunately this seems easy to do! sklasky@pppl.gov

Examples ofDesired Data Grid Functionality • High-speed, reliable access to remote data • Automated discovery of “best” copy of data • Manage replication to improve performance • Co-schedule compute, storage, network • “Transparency” wrt delivered performance • Enforce access control on data • Allow representation of “global” resource allocation policies Central Q: How must Grid architecture be extended to support these functions? sklasky@pppl.gov

Grid Protocols, Services, Tools:Enabling Sharing in Virtual Organizations • Protocol-mediated access to resources • Mask local heterogeneities • Extensible to allow for advanced features • Negotiate multi-domain security, policy • “Grid-enabled” resources speak protocols • Multiple implementations are possible • Broad deployment of protocols facilitates creation of Services that provide integrated view of distributed resources • Tools use protocols and services to enable specific classes of applications sklasky@pppl.gov

A Model Architecture for Data Grids Attribute Specification Replica Catalog Metadata Catalog Application Multiple Locations Logical Collection and Logical File Name MetacomputingDirectoryService Selected Replica Replica Selection Performance Information & Predictions NetworkWeatherService GridFTP Control Channel Disk Cache GridFTPDataChannel TapeLibrary Disk Array Disk Cache sklasky@pppl.gov Replica Location 1 Replica Location 2 Replica Location 3

Globus Toolkit Components Two major Data Grid components: 1. Data Transport and Access • Common protocol • Secure, efficient, flexible, extensible data movement • Family of tools supporting this protocol 2. Replica Management Architecture • Simple scheme for managing: • multiple copies of files • collections of files APIs, white papers: http://www.globus.org sklasky@pppl.gov

Motivation for a Common Data Access Protocol • Existing distributed data storage systems • DPSS, HPSS: focus on high-performance access, utilize parallel data transfer, striping • DFS: focus on high-volume usage, dataset replication, local caching • SRB: connects heterogeneous data collections, uniform client interface, metadata queries • Problems • Incompatible (and proprietary) protocols • Each require custom client • Partitions available data sets and storage devices • Each protocol has subset of desired functionality sklasky@pppl.gov

A Common, Secure, EfficientData Access Protocol • Common, extensible transfer protocol • Common protocol means all can interoperate • Decouple low-level data transfer mechanisms from the storage service • Advantages: • New, specialized storage systems are automatically compatible with existing systems • Existing systems have richer data transfer functionality • Interface to many storage systems • HPSS, DPSS, file systems • Plan for SRB integration sklasky@pppl.gov

A UniversalAccess/Transport Protocol • Suite of communication libraries and related tools that support • GSI, Kerberos security • Third-party transfers • Parameter set/negotiate • Partial file access • Reliability/restart • Large file support • Data channel reuse • All based on a standard, widely deployed protocol sklasky@pppl.gov

And the Universal Protocol is … GridFTP • Why FTP? • Ubiquity enables interoperation with many commodity tools • Already supports many desired features, easily extended to support others • Well understood and supported • We use the term GridFTP to refer to • Transfer protocol which meets requirements • Family of tools which implement the protocol • Note GridFTP > FTP • Note that despite name, GridFTP is not restricted to file transfer! sklasky@pppl.gov

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Database DISK DISK DISK DISK DISK DISK DISK DISK Summary Supercomputer PPPL petrel PPPL Pared Display Wall Webservices are run (Data Analysis, Data mining) Accessgrid is run here Chromium XPLIT Scirun or VTK CPU AVS/Express IDL HTTPAccessgrid docking sklasky@pppl.gov

CS Buzzwords / “The Grid and the Future of computing”

CS Buzzwords / “The Grid and the Future of computing”

Presentation Transcript