1 / 53

Open Science Grid and Applications

Open Science Grid and Applications. Bockjoo Kim U of Florida @ KISTI on July 5, 2007. An Overview of OSG. What is an OSG?. A scientific grid consortium and project Rely on the commitments of the participants Share common goals and vision other projects An evolution of Grid3

bryant
Download Presentation

Open Science Grid and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Science GridandApplications Bockjoo Kim U of Florida @ KISTI on July 5, 2007

  2. An Overview of OSG 07/05/2007

  3. What is an OSG? • A scientific grid consortium and project • Rely on the commitments of the participants • Share common goals and vision other projects • An evolution of Grid3 • Provides benefit to large scale science in the US 07/05/2007

  4. Driving Principles for OSG Simple and flexible Built from the bottom up Coherent but heterogeneous Performing and persistent Maximize eventual commonality Principles apply end-to-end 07/05/2007

  5. Virtual Organization in OSG 07/05/2007

  6. Timeline LIGO operation LIGO preparation LHC construction, preparation LHC operations European Grid + Worldwide LHC Computing Grid Campus, regional grids OSG Consortium iVDGL (NSF) OSG Trillium Grid3 GriPhyN (NSF) (DOE+NSF) PPDG (DOE) 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 07/05/2007

  7. Levels of Participation • Participating in the OSG Consortium • Using the OSG • Sharing Resources on OSG => Either or both with minimal entry threshold • Becoming a Stakeholder • All (large scale) users & providers are stakeholders • Determining the Future of OSG • Council Members determine the Future • Taking on Responsibility for OSG Operations • OSG Project is responsible for OSG Operations 07/05/2007

  8. OSG ArchitectureandHow to Use OSG 07/05/2007

  9. OSG : A Grid of Sites/Facilities • IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces. • CE: (modified) pre-ws GRAM • SE: SRM for large volume, gftp & (N)FS for small volume • Today’s scale: • 20-50 “active” sites (depending on definition of “active”) • ~ 5000 batch slots • ~ 1000TB storage • ~ 10 “active” sites with shared 10Gbps or better connectivity • Expected Scale for End of 2008 • ~50 “active” sites • ~30-50,000 batch slots • Few PB of storage • ~ 25-50% of sites with shared 10Gbps or better connectivity 07/05/2007

  10. OSG Components: Compute Element • Globus GRAM interface (Pre-WS) which supports many different local batch system • Priorities and policies : Through VO role mapping, Batch queue priority setting according to Site policites and priorities. the network & other OSG resources OSG Base (OSG 0.6.0) OSG Environment/Publication OSG Monitoring/Accounting EGEE Interop OSG gateway machine + services From ~20 CPU Department Computers to 10,000 CPU Super Computers Jobs run under any local batch system 07/05/2007

  11. Disk Areas in an OSG site • Shared filesystem as applications area at site. • Read only from compute cluster. • Role based installation via GRAM. • Batch slot specific local work space. • No persistency beyond batch slot lease. • Not shared across batch slots. • Read & write access (of course). • SRM/gftp controlled data area. • “persistent” data store beyond job boundaries. • Job related stage in/out. • SRM v1.1 today. • SRM v2.2 expected in Q2 2007 (space reservation). 07/05/2007

  12. OSG Components: Storage Element • Storage Services - access storage through storage resource manager (SRM) interface and GridFtp • (Typically) VO oriented: Allocation of shared storage through agreements between site and VO(s) facilitated by OSG the network & other OSG resources gsiftp://mygridftp.nowhere.edu srm://myse.nowhere.edu ( srm protocol ~ https protocol) OSG SE gateway From 20 GBytes Disk Cache To 4 Petabyte Robotic Tape Systems AnyShared Storage, e.g., dCache 07/05/2007

  13. Authentication and Authorization • OSG Responsibilities • X509 based middleware • Accounts may be dynamic/static, shared/FQAN-specific • VO Responsibilities • Instantiate VOMS • Register users & define/manage their roles • Site Responsibilities • Choose security model (what accounts are supported) • Choose VOs to allow • Default accept of all users in VO but individuals or groups within VO can be denied. 07/05/2007

  14. User Management • User obtains CERTfrom CA that is vetted by TAGPMA • User registers with VO and is added to VOMS of VO. • VO responsible for registration of VOMS with OSG GOC. • VO responsible for users to sign AUP. • VO responsible for VOMS operations. • VOMS shared for ops on multiple grids globally by some VOs. • Default OSG VO exists for new communities & single PIs. • Sites decide which VOs to support (striving for default admit) • Site populates GUMS daily from VOMSes of all VOs • Site chooses uid policy for each VO & role • Dynamic vs static vs group accounts • User uses whatever services the VO provides in support of users • VOs generally hide grid behind portal • Any and all support is responsibility of VO • Helping its users • Responding to complains from grid sites about its users. 07/05/2007

  15. Resource Management • Many resources are owned or statically allocated to one user community • The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) • The remainder of an organization’s available resources can be “used by everyone or anyone else” • Organization can decide against supporting particular VOs. • OSG staffs are responsible for monitoring and, if needed, managing this usage • Our challenge is to maximize good - successful - output from the whole system 07/05/2007

  16. Applications and Runtime Model • Condor-G client • Pre-WS or WS Gram as site gateway • Priority through VO role and policy, mitigate by site policy • User specific portion that comes with the job • VO specific portion is preinstalled and published • CPU access policies vary from site to site • Ideal runtime ~ O(hours) • Small enough to not loose too much due to preemption policies. • Large enough to be efficient despite long scheduling times of grid middleware. 07/05/2007

  17. Simple Workflow • Install Application Software at site(s) • VO admin install via GRAM. • VO users have read only access from batch slots. • “Download” data to site(s) • VO admin move data via SRM/gftp. • VO users have read only access from batch slots. • Submit job(s) to site(s) • VO users submit job(s)/DAG via condor-g. • Jobs run in batch slots, writing output to local disk. • Jobs copy output from local disk to SRM/gftp data area. • Collect output from site(s) • VO users collect output from site(s) via SRM/gftp as part of DAG. 07/05/2007

  18. Late Binding(A Strategy) • Grid is a hostile environment: • Scheduling policies are unpredictable • Many sites preempt, and only idle resources are free • Inherent diversity of Linux variants • Not everybody is truthful in their advertisement • Submit “pilot” jobs instead of user jobs • Bind user to pilot only after batch slot at a site is successfully leased, and “sanity checked”. • Re-bind user jobs to new pilot upon failure. 07/05/2007

  19. OSG Activies 07/05/2007

  20. OSG Activity Breakdown • Software (UW)– provide a software stack that meets the needs of OSG sites, OSG VOs and OSG operation while supporting interoperability with other national and inter-national cyber infrastructures • Integration (UC) – Verify, test and evaluate the OSG software • Operation (IU) – Coordinate the OSG sites, monitor the facility, and maintain and operate centralized services • Security (FNAL) – Define and evaluate procedures and software stack to prevent un-authorized activities and minimize interruption in service due to security concerns • Troubleshooting (UIOWA) – help sites and VOs to identify and resolve unexpected behavior of the OSG software stack • Engagement – (RENCI) Identify VOs and sites that can benefit from joining the OSG and “hold their hand” while becoming a productive member of the OSG community • Resource Management (UF) - Manages resource • Facility Management (UW) – Overall facility coordination. 07/05/2007

  21. OSG Facility Management Activies • Led by Miron Livny(Wisconsin, Condor) • Help sites join the facility and enable effective guaranteed and opportunistic usage of their resources by remote users • Help VOs join the facility and enable effective guaranteed and opportunistic harnessing of remote resources • Identify (through active engagement) new sites and VOs 07/05/2007

  22. OSG Software Activies • Package the Virtual Data Toolkit(Led by Wisconsin Condor Team) • Requires local building and testing of all components • Tools for incremental installation • Tools for verification of configuration • Tools for functional testing • Integration of the OSG stack • Verification Testbed (VTB) • Integration Testbed (ITB) • Deployment of the OSG stack • Build and deploy Pacman caches 07/05/2007

  23. OSG Software Release Process Input from stakeholders and OSG directors Test on OSG Validation Testbed VDT Release OSG Integration Testbed Release OSG Production Release 07/05/2007

  24. How Many Softwares? 15 Linux-like platforms supported ~45 components on 8 platforms built 07/05/2007

  25. OSG Security Activies • Infrastructure X509 certificate based • Operational security a priority • Exercise incident response • Prepare signed agreements, template policies • Audit, assess and train 07/05/2007

  26. Operations & Troubleshooting Activities • Well established Grid Operations Center at Indiana University • Users support distributed, including osg-general@opensciencegrid.org community support. • Site coordinator supports team of sites. • Accounting and Site Validation required services of sites. • Troubleshooting(U Iowa) looks at targetted end to end problems • Partnering with LBNL Troubleshooting work for auditing and forensics. 07/05/2007

  27. OSG and Related Grids 07/05/2007

  28. Campus Grids • Sharing across compute clusters is a change and a challenge for many Universities. • OSG, TeraGrid, Internet2, Educause working together 07/05/2007

  29. OSG and TeraGrid Complementaryandinteroperatinginfrastructures 07/05/2007

  30. International Activities • Interoperate with Europe for large physics users. • Deliver the US based infrastructure for the World Wide Large Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments. • Include off-shore sites when approached. • Help bring common interfaces and best practices to the standards forums. 07/05/2007

  31. ApplicationsandStatus of Utilization 07/05/2007

  32. Particile Physics and Computing • Science Driver Event rate = Luminosity x Crossection • LHC Revolution starting in 2008 • Luminosity x 10 • Crossection x 150 (e.g. top-quark) • Computing Challenge • 20PB in first year of running • ~ 100MSpecInt2000 ~ close to 100,000 cores 07/05/2007

  33. CMS Experiment Taiwan UK Italy Purdue Wisconsin UCSD Caltech Florida CMS Experiment (P-P Collision Particle Physics Experiment) OSG EGEE CERN USA@FNAL France Germany UNL MIT Data & jobs moving locally, regionally & globally within CMS grid. Transparently across grid boundaries from campus to globus. 07/05/2007

  34. CMS Data Analysis 07/05/2007

  35. Opportunistic Resource Use • In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. • The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. • The Council members agreed to contribute resources to meet this request. 07/05/2007

  36. D0 Throughput D0 Event Throughput D0 OSG CPUHours / Week 07/05/2007

  37. Lessons Learned from D0 Case • Consortium members contributed significant opportunistic resources as promised. • VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. • Combined teams make large production runs effective. • How does this scale? • how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon. 07/05/2007

  38. Use Case by Other Disciplines • Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April • Weather Research Forecast: MPI job running on 1 OSG site; more to come • CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease • Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. • NanoHUB at Purdue: Biomoca and Nanowire production. 07/05/2007

  39. OSG Usage By Numbers 39 Virtual Communities 6 VOs with >1000 jobs max. (5 particle physics & 1 campus grid) 4 VOs with 500-1000 max. (two outside physics) 10 VOs with 100-500 max (campus grids and physics) 07/05/2007

  40. Running Jobs During Last Year 07/05/2007

  41. Jobs Running at Sites >1k max 5 sites >0.5k max 10 sites >100 max 29 sites Total: 47 sites Many small sites, or with mostly local activity. 07/05/2007

  42. CMS Xfer on OSG in June ‘06 450MByte/sec All CMS sites have exceeded 5TB per day in June 2006. Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day. 07/05/2007

  43. 07/05/2007

  44. Summary • OSG Facility utilization is steadily being increased • ~2-4500 jobs all the time • HEP, Astro, Nuclear Phys. but also Bio/Eng/Med • Constant effort/troubleshooting is being poured to make OSG usable, robust and performant. • Show use to other sciences. • Trying to bring campus into a pervasive distributed infrastructure. • Bring research into a ubiquitous appreciation of the value of (distributed, opportunistic) computation • Educate people to utilize the resources 07/05/2007

  45. Out of Bound Slides 07/05/2007

  46. Principle: Simple and Flexible The OSG architecture will follow the principles of symmetry and recursion wherever possible This principle guides us in our approaches to - Support for hierachies of and property inheritance of VOs. - Federations and interoperability of grids (grids of grids). - Treatment of policies and resources. 07/05/2007

  47. Principle: Coherent but heterogeneous The OSG architecture is VO based. Most services are instantiated within the context of a VO. This principle guides us in our approaches to - Scope of namespaces & action of services. - Definition of and services in support of an OSG-wide VO. - No concept of “global” scope. - Support for new and dynamic VOs should be light-weight. 07/05/2007

  48. OSG Security Activies (continues…) I trust it is the VO (or agent) Storage Data I trust it is the user I trust the job is for the VO VO infra. C E W W W W W W W W W Jobs I trust it is the user’s job W W W W W W VO User Site 07/05/2007

  49. Principle: Bottom up/Persistency All services should function and operate in the local environment when disconnected from the OSG environment. This principle guides us in our approaches to - The Architecture of Services. E.G. Services are required to manage their own state, ensure their internal state is consistent and report their state accurately. - Development and execution of applications in a local context, without an active connection to the distributed services. 07/05/2007

  50. Principle: Commonality OSG will provide baseline services and a reference implementation. The infrastructure will support support incremental upgrades The OSG infrastructure should have minimal impact on a Site. Services that must run with superuser privileges will be minimized Users are not required to interact directly with resource providers. 07/05/2007

More Related