670 likes | 770 Views
Grid Computing from a solid past to a bright future?. David Groep NIKHEF DataGrid and VL group 2003-03-14. Grid – more than a hype?. Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately,
E N D
Grid Computing from a solid past to a bright future? David GroepNIKHEF DataGrid and VL group2003-03-14
Grid – more than a hype? Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately, just as you plug in a lamp to get instant light. … Far from being science-fiction, this is the idea the XXXXXX project is about to make into reality. … from a project brochure in 2001
Grids and their (science) applications • Origins of the grid • What makes a Grid? • Grid implementations today • New standards • Dutch dimensions
The GRID: networked data processing centres and ”middleware” software as the “glue” of resources. Researchers perform their activities regardless geographical location, interact with colleagues, share and access data Scientific instruments and experiments provide huge amount of data Grid – a vision Federico.Carminati@cern.ch
Communities and Apps ENVISAT • 10 instruments on board • 200 Mbps data rate to ground • 400 Tbytes data archived/year • ~100 `standard’ products • 10+ dedicated facilities in Europe • ~700 approved science user projects http://www.esa.int/
Added value for EO • enhance the ability to access high level products • allow reprocessing of large historical archives • data fusion and cross-validation, …
The Need for Grids: LHC • Physics @ CERN • LHC particle accellerator • operational in 2007 • 5-10 Petabyte per year • 150 countries • > 10000 Users • lifetime ~ 20 years 40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) level 2 - embedded 5 KHz (5 GB/sec) level 3 - PCs 100 Hz (100 MB/sec) data recording & offline analysis http://www.cern.ch/
And More … Bio-informatics • For access to data • Large network bandwidth to access computing centers • Support of Data banks replicas (easier and faster mirroring) • Distributed data banks • For interpretation of data • GRID enabled algorithms BLAST on distributed data banks, distributed data mining
And even more … • financial services, life sciences, strategy evaluation, … • instant immersive teleconferencing • remote experimentation • pre-surgical planning and simulation
Why is the Grid successful? • Applications need large amounts of data or computation • Ever larger, distributed user community • Network grows faster than compute power/storage
Inter-networking systems • Continuous growth (now ~ 180 million hosts) • Many protocols and APIs (~3500 RFCs) • Focus on heterogeneity (and security) http://www.caida.org/ http://www.isc.org/
Remote Service • RPC proved hugely successful within domains • Network Information System (YP) • Network File System • Typical client-server stuff… • CORBA– also intra-domain • Extension of RPC to OO design model • Diversification • Web Services– venturing in the inter org. domain • Standard service descriptions and discovery • Common syntax (XML/SOAP)
Grid beginnings - Systems • distributed computing research • Gigabit network test beds • Meta-supercomputing (I-WAY) • Condor ‘flocking’ GUSTO meta-computing test bed in 1999
Grid beginnings - Apps • Solve problems using systems in one ‘domain’ • parameter sweeps on batch clusters • PIAF for (HE) physics analysis • … • Solvers using systems in multiple domains • SETI@home • … • Ready for the next step …
What is the Grid about? Resource sharing and coordinated problem solving in dynamic multi-institutional virtual organisations Virtual Organisation (VO): A set of individuals or organisations, not under single hierarchical control, temporarily joining forces to solve a particular problem at hand, bringing to the collaboration a subset of their resources, sharing those at their discretion and each under their own conditions.
What makes a Grid? Coordinates resources not subject to central control … • More than cluster & centralised distributed computing • Security, AAA, billing&payment, integrity, procedures … using standard, open protocols … • More than single-purpose solutions • Requires interoperability, standards body, multiple implementations … to deliver non-trivial QoS. • Sum more than individual components (e.g. single sign-on, transparency) Ian Foster in Grid Today, 2002
Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Grid Architecture (v1)
Application Internet Protocol Architecture Application Collective Transport Internet Resource Link Connectivity Fabric Protocol Layers & Bodies Application Presentation Standards bodies: GGFW3COASIS Session Transport Standards body: IETF Network Data Link Standards body: IEEE Physical
Grid Middleware • Globus Project started 1997 • Focus on research only • Used and extended by many other projects • Toolkit `bag-of-services' approach – not a complete architecture • Several middleware projects: • EU DataGrid – production focus • CrossGrid, GridLAB, DataTAG, PPDG, GriPhyN • Condor • In NL: ICES/KIS Virtual Lab, VL-E http://www.globus.org/ http://www.edg.org/ http://www.vl-e.nl/
Grid Protocols Today • Use common Grid Security Infrastructure: • Extensions to TLS for delegation (single sign-on) • Organisation of users in VOs • Currently deployed main services • GRAM (resource allocation):attrib/value pairs over HTTP • GridFTP (bulk file transfer):FTP with GSI and high-throughput extras (striping) • MDS (monitoring and discovery service):LDAP + common resource description schema • Next generation: Grid Services (OGSA)
Grid Security Infrastructure • Requirements: • “Secure” • User identification • Accountability • Site autonomy • Usage control • Single sign-on • Dynamic VOs any time and any place • Mobility (“easyEverything”, airport kiosk, handheld) • Multiple roles for each user • Easy!
Authentication – PKI • Asserting, binding identities • Trust issues on a global scale • EDG: CA Coord. Group • 16 national certification authorities+ CrossGrid CAs • policies & procedures mutual trust • users identified by CA’s certificates • Part of world-wide GridPMA • Establishing minimum requirements • Includes several US and AP CAs • Scaling still a challenge http://marianne.in2p3.fr/datagrid/ca and http://www.gridpma.org/
Getting People TogetherVirtual Organisations • The user community `out there’ is large & highly dynamic • Applying at each individual resource does not scale • Users get together to form Virtual Organisations: • Temporary alliance of stakeholders (users and/or resources) • Various groups and roles • Managed by (legal) contracts • Setup and dissolved at will*currently not yet that fast • Authentication, Authorization, Accounting (AAA)
Authorization (today) • Virtual Organisation “directories” • Members are listed in a directory • Managed by VO responsible • Sites extract access lists from directories • Only for VOs they have “contract” with • Still need OS-local accounts • May also use automated tools (sysadm level) • poolAccounts • slashGrid http://cern.ch/hep-project-grid-scg/
Grid Security in Action • Key elements in Grid Security Infrastructure (GSI) • Proxy • Trusted certificate store • Delegation: full or restricted rights • Access services directly • Establish trust between processes
Communication* GSI in Action“Create Processes at A and B that Communicate & Access Files at C” Single sign-on via “grid-id” & generation of proxy cred. User Proxy User Proxy credential Or: retrieval of proxy cred. from online repository Remote process creation requests* Site A (Kerberos) GSI-enabled GRAM server Authorize Map to local id Create process Generate credentials Ditto GSI-enabled GRAM server Site B (Unix) Computer Computer Process Process Local id Local id Kerberos ticket Restricted proxy Remote file access request* Restricted proxy GSI-enabled FTP server Site C (Kerberos) Authorize Map to local id Access file * With mutual authentication Storage system
Large-scale production Grids • Until recently usually “smallish” • O(10) sites, O(20) users • Only one community (VO) Running Production Grids • EU DataGrid (EDG) • Stress testing: up to 2000 jobs at any time • Focus on stability (>99% of jobs complete correctly) • VL-E • NASA IPG • LCG, PPDG/iVDGL Example Grid
EU DataGrid • Middleware research project (2001-2003) • Driving applications: • HE Physics • Earth Observation • Biomedicine • Operational testbed • 25 sites, 50 CEs • 8 VOs • ~ 350 users, growing with ~50/month! http://www.eu-datagrid.org/
EU DataGrid Test Bed 1 • DataGrid TB1: • 14 countries • 21 major sites • CrossGrid: 40 more sites • Submitting Jobs: • Login only once,run everywhere • Cross administrativeboundaries in asecure and trusted way • Mutual authorization http://marianne.in2p3.fr/
EDG: 3 Tier Architecture Request Request Result Data Client‘User Interface’ Execution Resources‘ComputeElement’ Data Server‘StorageElement’ Database server
Example: GOME Step 8:Visualize Results
GOME processing cycle ‘Raw’ satellite data from the GOME instrument Level 1 ESA – KNMI Processing of raw GOME data to ozone profiles With Opera and Noprego LIDAR data database IPSL Validate GOME ozone profiles With Ground Based measurements Level 2 DataGrid Visualization
Situation on a Grid INFORMATION SERVICES
Information Services (IS) HARDWARE – fabric and storage • Cluster information • Storage capacity • Network connections Today: info-providers publish to IShierarchical directory Next week: R-GMA producer-consumer framework based on RDBMS DATA – files and collections • File replica locations Today: Replica Catalogue (RC)In few month: Replica Location Service SOFTWARE – programs & services • RunTime Environment tags • Service entries (SE, CE, RC) Today: in IS
Grid job submission • Basic protocol: GRAM • Job submission at individual CE • Status inqueries • Credential delegation • File staging • Job manager (baby-sitter) • Collective services (Workload Mngt System) • Resource broker • Job submission service • Logging and Bookkeeping • The EDG WMS tries to optimize the usage of resources • Will re-submit on resource failure Many WMS's exist ...
Job Preparation • Information to be specified • Job characteristics • Requirements and Preferences of the computing system • Software dependencies • Job Data requirements • Specified using a Job Description Language (JDL)
Example JDL File Executable = “gridTest”; StdError = “stderr.log”; StdOutput = “stdout.log”; InputSandbox = {“home/joda/test/gridTest”}; OutputSandbox = {“stderr.log”, “stdout.log”}; InputData = “LF:testbed0-00019”; ReplicaCatalog = “ldap://sunlab2g.cnaf.infn.it:2010/ \ lc=test, rc=WP2 INFN Test, dc=infn, dc=it”; DataAccessProtocol = “gridftp”; Requirements = other.Architecture==“INTEL” && \ other.OpSys==“LINUX” && \ other.FreeCpus >=4; Rank = “other.MaxCpuTime”; This JDL is input to dg-job-submit
UI JDL Job Submission Scenario Replica Catalogue (RC) Information Service (IS) Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element CE)
Input Sandbox UI JDL Job Submit Event Example Job Status submitted Replica Catalogue (RC) Information Service (IS) Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
waiting UI JDL Example Job Status submitted Replica Catalogue (RC) Information Service (IS) Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
UI JDL ready Example Job Status submitted Replica Catalogue (RC) Information Service (IS) waiting Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
UI JDL scheduled BrokerInfo Example Job Status submitted Replica Catalogue (RC) Information Service (IS) waiting ready Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
UI JDL Input Sandbox running Example Job Status submitted Replica Catalogue (RC) Information Service (IS) waiting ready scheduled Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
UI JDL running Job Status Example Job Status submitted Replica Catalogue (RC) Information Service (IS) waiting ready scheduled Resource Broker (RB) Storage Element (SE) Logging & Bookkeeping (LB) Job Submission Service (JSS) Compute Element (CE)
UI JDL done Job Status Example Job Status submitted Replica Catalogue Information Service waiting ready scheduled Resource Broker running Storage Element Logging & Bookkeeping Job Submission Service Compute Element
UI JDL outputready Output Sandbox Job Status Example Job Status submitted Replica Catalogue Information Service waiting ready scheduled Resource Broker running Storage Element done Logging & Bookkeeping Job Submission Service Compute Element
UI JDL Output Sandbox cleared Example Job Status submitted Replica Catalogue (RC) Information Service (IS) waiting ready scheduled Resource Broker (RB) running Storage Element (SE) done Logging & Bookkeeping (LB) Job Submission Service (JS) outputready Compute Element (CE)
Data Access & Transport • Requirements • Support single sign-on • Transfer large files quickly • Confidentiality/integrity • Integrated with information systems (RC) • Extensions to FTP protocol: GridFTP • GSI, DCAU • Server striping, parallel streams • TCP protocol optimisation not trivial!