600 likes | 723 Views
Grid Computing - A Primer. Sridhara Dasu, Department of Physics, U. Wisconsin. Grid Computing What is the buzz all about? What is the promise? My Perspective What is in it for me? How is it working for us? In UW-Madison And, beyond … Conclusion Why should you be interested?
E N D
Grid Computing - A Primer Sridhara Dasu, Department of Physics, U. Wisconsin • Grid Computing • What is the buzz all about? • What is the promise? • My Perspective • What is in it for me? • How is it working for us? • In UW-Madison • And, beyond … • Conclusion • Why should you be interested? • What are the consequences for you? Acknowledgements: Condor Team GLOBUS Team I.Foster/Argonne M.Livny/Wisconsin D.Bradley/Wisconsin Sridhara Dasu
Grid Computing isin the News … Sridhara Dasu
The Opportunity (or Challenge):Computational Cornucopia • Abundant computation, data, bandwidth • In many fields, too much data—not too little • Simulations of unprecedented accuracy • Ubiquitous internet distance not a barrier • But as a consequence • Rate of change accelerates • Complex problems multidisciplinary distributed teams & sharing of resources & expertise • Without infrastructure, you can’t compete Sridhara Dasu
Why Distributed Teams Are Important • Increasingly challenging & complex problems • Particle physics, Global change, Cosmology, Life sciences • Manufacturing, Mineral exploration • Film production, Game development, … • Required expertise & resources also distributed • People • Computational capability • Data • Sensors Sridhara Dasu
The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” http://www.mkp.com/mk/default.asp?isbn=1558609334 • Enable integration of distributed service & resources • Using general-purpose protocols & infrastructure • To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001 Sridhara Dasu
What is a Grid? • The key criteria: • Coordinated distributed resources … • Uses standard, open, general-purpose protocols and interfaces … • Deliver non-trivial qualities of service. • What is not a Grid? • A cluster, a network attached storage device, a scientific instrument, a network, etc. • Each is an important component of a Grid, but by itself does not constitute a Grid Sridhara Dasu
Why Should You Care? 1) Grid is a promising technology [Vision] • It ushers in a virtualized, collaborative, distributed world 2) Grids are being commissioned now [Reality] • Grids are built (not bought), but are delivering real benefits in academic and commercial settings 3) An open Grid is to your advantage [Future] • Standards are being defined now that will determine the future of this technology Sridhara Dasu
The Power Grid:On-Demand Access to Electricity Decouple production & consumption, enabling • On-demand access • Economies of scale • Consumer flexibility • New devices Quality, economies of scale Time Sridhara Dasu
But Computing Isn’t Really Like Electricity! • How about “access computing resources like we access Web content”? • We have no idea where a website is, or on what computer or operating system it runs • Two interrelated opportunities 1) Enhance economy, flexibility, access by virtualizing computing resources 2) Deliver entirely new capabilities by integrating distributed resources Sridhara Dasu
Automatically connect applications to services • Dynamic & intelligent • provisioning Application Virtualization Infrastructure Virtualization • Dynamic & intelligent • provisioning • Automatic failover Virtualization Applications: Delivery Application Services: Distribution Servers: Execution Source: The Grid: Blueprint for a New Computing Infrastructure (2nd Edition), 2004 Sridhara Dasu
Local Clusters to Global Grids Cluster Grid Enterprise Grid Global Grid Sridhara Dasu
Grid Deployment Trends Corporate Corporate Mission Criticality Scientific Department Enterprise Collaboration Internet Sridhara Dasu
Transparent Service Utility Computing Utility Computing Grid Autonomic Computing Autonomic Computing Service- Oriented Architecture Service- Oriented Architecture Webster says: Autonomic = acting or occurring involuntarily <autonomic reflexes> Sridhara Dasu
Layers of Grid Architecture Sridhara Dasu
Multidisciplinary Teams:Problem Solving in the 21st Century • Teams organized around common goals • Communities: “Virtual organizations” • With diverse membership & capabilities • Heterogeneity is a strength not a weakness • And geographic and political distribution • No location/organization possesses all required skills and resources • Must adapt as a function of the situation • Adjust membership, reallocate responsibilities, renegotiate resources Sridhara Dasu
Challenging Technical Requirements • Dynamic formation and management of virtual organizations • Discovery & online negotiation of access to services: who, what, why, when, how • Configuration of applications and systems able to deliver multiple qualities of service • Autonomic management of distributed infrastructures, services, and applications • Management of distributed state • Open, extensible, evolvable infrastructure Sridhara Dasu
The Globus Project™Making Grid computing a reality (since 1996) • Close collaboration with real Grid projects in science and industry • The Globus Toolkit®: Open source software base for building Grid infrastructure and applications • Development and promotion of standard Grid protocols to enable interoperability and shared infrastructure • Development and promotion of standard Grid software APIs to enable portability and code sharing • Global Grid Forum: We co-founded GGF to foster Grid standardization and community Sridhara Dasu
Globus Toolkit 2Key Protocols • The Globus Toolkit v2 (GT2)centers around four key protocols • Connectivity layer: • Security: Grid Security Infrastructure (GSI) • Resource layer: • Resource Management: Grid Resource Allocation Management (GRAM) • Information Services: Grid Resource Information Protocol (GRIP) • Data Transfer: Grid File Transfer Protocol (GridFTP) • Also key collective layer protocols • Info Services, Replica Management, etc. Sridhara Dasu
Est. 1986 C High Throughput Computing ondor Resource Management UW Condor Project - Miron Livny’s group (http://www.cs.wisc.edu/condor) • Predates Globus • High throughput computing on commodity resources • Successful enterprise level deployment • UW Computer Science Condor pool • UW Condor pools in other departments • INFN/Italy pools • Inter-pool flocking • … • Also, some industrial users • … Sridhara Dasu
Application Submit (client) Application Agent Customer Agent Matchmaker Owner Agent Execute (service) Remote Execution Agent Local Resource Manager Resource The Layers of Condor Complete solution for resource management Sridhara Dasu
A Grid Job • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files, input/output Sridhara Dasu
Condor Universes • The Standard Universe • Check-points executable state • Job migration to other resources to continue execution • Transparent IO redirection to user submit machines • Robust against resource preemption for higher priority tasks + resource failures • Limitations on applications (e.g., shlib, MT) • The Vanilla Universe • Traditional batch jobs with no limitations • External solutions for IO redirection • Not robust against preemption or resource failures • The Globus Universe (new) • Adapted to emerging Grid standards • Part of Globus Toolkit Sridhara Dasu
Globus middleware deployed across entire Grid remote access to computational resources dependable, robust data transfer Condor job scheduling across multiple resources strong fault tolerance with checkpointing and migration layered over Globus as “personal batch system” for the Grid Condor-G: Globus + Condor Sridhara Dasu
Condor Globus Toolkit Condor Condor-G User/Application Grid Fabric (processing, storage, communication) Sridhara Dasu
Creating a Submit Description File • A plain ASCII text file • Tells Condor-G about your job: • Which executable, grid site, input, output and error files to use, command-line arguments, environment variables, etc. • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc. Sridhara Dasu
Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = globus GlobusScheduler = host.domain.edu/jobmanager Executable = my_job Queue Sridhara Dasu
Running condor_submit • You give condor_submit the name of the submit file you have created • condor_submit parses the file, checks for errors, and creates a “ClassAd” that describes your job(s) • Sends your job’s ClassAd(s) and executable to the Condor-G schedd, which stores the job in its queue • Atomic operation, two-phase commit • View the queue with condor_q Sridhara Dasu
Condor_q Globus Resource Condor_submit Gate Keeper Condor-G Local Job Scheduler Condor-G condor_submit sequence Sridhara Dasu
Running condor_submit % condor_submit my_job.submit-file Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job 1 jobs; 1 idle, 0 running, 0 held % Sridhara Dasu
DAGMan • Directed Acyclic Graph Manager • DAGMan allows you to specify the dependencies between your Condor-G jobs, so it can manage them automatically for you. • (e.g., “Don’t run job “B” until job “A” has completed successfully.”) Sridhara Dasu
Job A Job B Job C Job D What is a DAG? • A DAG is the datastructure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. • Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Sridhara Dasu
Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D • each node will run the Condor-G job specified by its accompanying Condor submit file Sridhara Dasu
What about Data? Data Placement* (DaP) must be an integral part of the end-to-end solution Stork (Another UW-Computer Science Product) • Schedules, runs, monitors, and manages Data Placement (DaP) jobs in a heterogeneous Grid environment & ensures that they complete. • What Condor (G) means for computational jobs, Stork means the same for DaP jobs. • Just submit a bunch of DaP jobs and then relax.. • Interoperates with various storage services * Space management and Data transfer Sridhara Dasu
SRM SRB NeST Full Condor-G Capabilities Planner(s) DAGMan Stork (DaP) Condor-G(compute) Gate Keeper StartD RFT GridFTP Sridhara Dasu
UW “Enterprise Level” Grid • Condor pool at CS • 1000 ~1GHz Intel CPUs • Condor pools at various departments • 100 ~2.4 GHz Intel CPUs at Physics, etc. • New: Grid Laboratory of Wisconsin • Condor jobs flock from various departments to CS Pool as needed • Excellent utilization • Especially when the Condor Standard Universe is used • Premption, Checkpointing, Job Migration Sridhara Dasu
Grid Laboratory of Wisconsin 2003 Initiative funded by NSF/UWSix GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science Phase-1 already has ~300 Xeon CPUs Expect to grow to about 700 CPUs + 100 TB disk Sridhara Dasu
Condor/GLOW Ideas • Exploit commodity hardware for high throughput computing • The base hardware is the same at all sites • Local configuration optimization as needed • e.g., Number of CPU elements vs storage elements • Must meet global requirements • It turns out that our initial assessment calls for almost identical configuration at all sites • Managed locally at 6 sites • Shared globally across all sites • Higher priority for local jobs Sridhara Dasu
The Large Hadron Collider Sridhara Dasu
The Large Hadron Collider Building and commissioning the accelerator and detectors, and extracting interesting physics out of this massive data sample is a big challenge. Sridhara Dasu
Event Filtering Before Archival Output: 1MB/event @100 Hz Petabyte per year Sridhara Dasu
Analysis Teams + Resources Input: ~109 events (petabyte databases) Complex algorithms developed by collaborating physicists Output: Publications with ~100s of selected events Sridhara Dasu
Simulation: Early Grid Deployment • Detailed simulations necessary • Large numbers of background events need to be simulated • Dominated by fluctuations of tails • Computation scale • Background events occur on every crossing - 40 MHz • Up to 10 minutes on a 1 GHz CPU to simulate full event • 2 x 109 s CPU time to simulate 1 s of LHC operation • Requires 1000 CPUs running for 1 month • CMS has large number of detector channels, 108 • Each event requires 1-10 MB storage space • 32-320 TB needed for 1 s of LHC operation • Optimizing CPU and data storage • Simulate in bins and reuse some data • Pleasantly parallel application • Ideal Grid testbed candidate • Used UW “enterprise level” classic Condor grid successfully • With Grid2003 used nation wide Globus/Condor-G based true grid Sridhara Dasu
Tapping UW “Enterprise Level” Grid We tapped resources on the UW campus opportunistically We produced more events in 2003 than most other CMS collaborators - because of using our UW enterprise level grid and condor standard universe! 2004 numbers are through March, and were also running our new C++ simulation code that is a factor of 2 slower. We have typically used less than 50% of available resources and ran for about 30% of the year. Sridhara Dasu
Tapping Global Grid : Grid3 Sridhara Dasu
Cost Savings from Grids • The size of cost savings from grids will come in two waves: • First from the adoption of clusters • Then from the adoption of Enterprise Grids • Firms using Clusters estimate that cost savings will be small at first, but will grow to 15% to 30% savings in IT Costs in 2005-2008. • Firms planning to use Enterprise Grids estimate that they will experience a second wave of benefits. Savings will grow to 15% to 30% by 2007-2010. Source: Robert Cohen, “Grid Computing: Projected Impact on North Carolina’s Economy & Broadband Use through 2010,” Rural Internet Access Authority, September 2003. http://www.e-nc.org Sridhara Dasu
Grid drawbacks being addressed now • Low utilization of enterprise resources • High cost of provisioning for peak demand • Inadequate resources prevent use of advanced applications • Lack of information integration Sridhara Dasu
Cyberinfrastructure & VOs Relevance Far Beyond Science 1) Virtualization of information technology • From vertical silos to on-demand access • Improve efficiency of delivery, increase flexibility of use • E.g., financial services, e-commerce 2) New applications, products, & services enabled by much computation & data • Media, life sciences, manufacturing, seismic exploration, online gaming, etc., etc., etc. Sridhara Dasu
The Value of Grid Computing:IBM Perspective Increased Efficiency Higher Quality of Service Increased Productivity & ROI Reduced Complexity & Cost Improved Resiliency Sridhara Dasu
switchfabric compute storage Grids: HP Perspective computing utility or GRID virtual data center value programmable data center grid-enabled systems UDC Tru64, HP-UX, Linux clusters Open VMS clusters, TruCluster, MC ServiceGuard today shared, traded resources Sridhara Dasu