530 likes | 697 Views
Open Science Grid and Applications. Bockjoo Kim U of Florida @ KISTI on July 5, 2007. An Overview of OSG. What is an OSG?. A scientific grid consortium and project Rely on the commitments of the participants Share common goals and vision other projects An evolution of Grid3
E N D
Open Science GridandApplications Bockjoo Kim U of Florida @ KISTI on July 5, 2007
An Overview of OSG 07/05/2007
What is an OSG? • A scientific grid consortium and project • Rely on the commitments of the participants • Share common goals and vision other projects • An evolution of Grid3 • Provides benefit to large scale science in the US 07/05/2007
Driving Principles for OSG Simple and flexible Built from the bottom up Coherent but heterogeneous Performing and persistent Maximize eventual commonality Principles apply end-to-end 07/05/2007
Virtual Organization in OSG 07/05/2007
Timeline LIGO operation LIGO preparation LHC construction, preparation LHC operations European Grid + Worldwide LHC Computing Grid Campus, regional grids OSG Consortium iVDGL (NSF) OSG Trillium Grid3 GriPhyN (NSF) (DOE+NSF) PPDG (DOE) 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 07/05/2007
Levels of Participation • Participating in the OSG Consortium • Using the OSG • Sharing Resources on OSG => Either or both with minimal entry threshold • Becoming a Stakeholder • All (large scale) users & providers are stakeholders • Determining the Future of OSG • Council Members determine the Future • Taking on Responsibility for OSG Operations • OSG Project is responsible for OSG Operations 07/05/2007
OSG ArchitectureandHow to Use OSG 07/05/2007
OSG : A Grid of Sites/Facilities • IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces. • CE: (modified) pre-ws GRAM • SE: SRM for large volume, gftp & (N)FS for small volume • Today’s scale: • 20-50 “active” sites (depending on definition of “active”) • ~ 5000 batch slots • ~ 1000TB storage • ~ 10 “active” sites with shared 10Gbps or better connectivity • Expected Scale for End of 2008 • ~50 “active” sites • ~30-50,000 batch slots • Few PB of storage • ~ 25-50% of sites with shared 10Gbps or better connectivity 07/05/2007
OSG Components: Compute Element • Globus GRAM interface (Pre-WS) which supports many different local batch system • Priorities and policies : Through VO role mapping, Batch queue priority setting according to Site policites and priorities. the network & other OSG resources OSG Base (OSG 0.6.0) OSG Environment/Publication OSG Monitoring/Accounting EGEE Interop OSG gateway machine + services From ~20 CPU Department Computers to 10,000 CPU Super Computers Jobs run under any local batch system 07/05/2007
Disk Areas in an OSG site • Shared filesystem as applications area at site. • Read only from compute cluster. • Role based installation via GRAM. • Batch slot specific local work space. • No persistency beyond batch slot lease. • Not shared across batch slots. • Read & write access (of course). • SRM/gftp controlled data area. • “persistent” data store beyond job boundaries. • Job related stage in/out. • SRM v1.1 today. • SRM v2.2 expected in Q2 2007 (space reservation). 07/05/2007
OSG Components: Storage Element • Storage Services - access storage through storage resource manager (SRM) interface and GridFtp • (Typically) VO oriented: Allocation of shared storage through agreements between site and VO(s) facilitated by OSG the network & other OSG resources gsiftp://mygridftp.nowhere.edu srm://myse.nowhere.edu ( srm protocol ~ https protocol) OSG SE gateway From 20 GBytes Disk Cache To 4 Petabyte Robotic Tape Systems AnyShared Storage, e.g., dCache 07/05/2007
Authentication and Authorization • OSG Responsibilities • X509 based middleware • Accounts may be dynamic/static, shared/FQAN-specific • VO Responsibilities • Instantiate VOMS • Register users & define/manage their roles • Site Responsibilities • Choose security model (what accounts are supported) • Choose VOs to allow • Default accept of all users in VO but individuals or groups within VO can be denied. 07/05/2007
User Management • User obtains CERTfrom CA that is vetted by TAGPMA • User registers with VO and is added to VOMS of VO. • VO responsible for registration of VOMS with OSG GOC. • VO responsible for users to sign AUP. • VO responsible for VOMS operations. • VOMS shared for ops on multiple grids globally by some VOs. • Default OSG VO exists for new communities & single PIs. • Sites decide which VOs to support (striving for default admit) • Site populates GUMS daily from VOMSes of all VOs • Site chooses uid policy for each VO & role • Dynamic vs static vs group accounts • User uses whatever services the VO provides in support of users • VOs generally hide grid behind portal • Any and all support is responsibility of VO • Helping its users • Responding to complains from grid sites about its users. 07/05/2007
Resource Management • Many resources are owned or statically allocated to one user community • The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) • The remainder of an organization’s available resources can be “used by everyone or anyone else” • Organization can decide against supporting particular VOs. • OSG staffs are responsible for monitoring and, if needed, managing this usage • Our challenge is to maximize good - successful - output from the whole system 07/05/2007
Applications and Runtime Model • Condor-G client • Pre-WS or WS Gram as site gateway • Priority through VO role and policy, mitigate by site policy • User specific portion that comes with the job • VO specific portion is preinstalled and published • CPU access policies vary from site to site • Ideal runtime ~ O(hours) • Small enough to not loose too much due to preemption policies. • Large enough to be efficient despite long scheduling times of grid middleware. 07/05/2007
Simple Workflow • Install Application Software at site(s) • VO admin install via GRAM. • VO users have read only access from batch slots. • “Download” data to site(s) • VO admin move data via SRM/gftp. • VO users have read only access from batch slots. • Submit job(s) to site(s) • VO users submit job(s)/DAG via condor-g. • Jobs run in batch slots, writing output to local disk. • Jobs copy output from local disk to SRM/gftp data area. • Collect output from site(s) • VO users collect output from site(s) via SRM/gftp as part of DAG. 07/05/2007
Late Binding(A Strategy) • Grid is a hostile environment: • Scheduling policies are unpredictable • Many sites preempt, and only idle resources are free • Inherent diversity of Linux variants • Not everybody is truthful in their advertisement • Submit “pilot” jobs instead of user jobs • Bind user to pilot only after batch slot at a site is successfully leased, and “sanity checked”. • Re-bind user jobs to new pilot upon failure. 07/05/2007
OSG Activies 07/05/2007
OSG Activity Breakdown • Software (UW)– provide a software stack that meets the needs of OSG sites, OSG VOs and OSG operation while supporting interoperability with other national and inter-national cyber infrastructures • Integration (UC) – Verify, test and evaluate the OSG software • Operation (IU) – Coordinate the OSG sites, monitor the facility, and maintain and operate centralized services • Security (FNAL) – Define and evaluate procedures and software stack to prevent un-authorized activities and minimize interruption in service due to security concerns • Troubleshooting (UIOWA) – help sites and VOs to identify and resolve unexpected behavior of the OSG software stack • Engagement – (RENCI) Identify VOs and sites that can benefit from joining the OSG and “hold their hand” while becoming a productive member of the OSG community • Resource Management (UF) - Manages resource • Facility Management (UW) – Overall facility coordination. 07/05/2007
OSG Facility Management Activies • Led by Miron Livny(Wisconsin, Condor) • Help sites join the facility and enable effective guaranteed and opportunistic usage of their resources by remote users • Help VOs join the facility and enable effective guaranteed and opportunistic harnessing of remote resources • Identify (through active engagement) new sites and VOs 07/05/2007
OSG Software Activies • Package the Virtual Data Toolkit(Led by Wisconsin Condor Team) • Requires local building and testing of all components • Tools for incremental installation • Tools for verification of configuration • Tools for functional testing • Integration of the OSG stack • Verification Testbed (VTB) • Integration Testbed (ITB) • Deployment of the OSG stack • Build and deploy Pacman caches 07/05/2007
OSG Software Release Process Input from stakeholders and OSG directors Test on OSG Validation Testbed VDT Release OSG Integration Testbed Release OSG Production Release 07/05/2007
How Many Softwares? 15 Linux-like platforms supported ~45 components on 8 platforms built 07/05/2007
OSG Security Activies • Infrastructure X509 certificate based • Operational security a priority • Exercise incident response • Prepare signed agreements, template policies • Audit, assess and train 07/05/2007
Operations & Troubleshooting Activities • Well established Grid Operations Center at Indiana University • Users support distributed, including osg-general@opensciencegrid.org community support. • Site coordinator supports team of sites. • Accounting and Site Validation required services of sites. • Troubleshooting(U Iowa) looks at targetted end to end problems • Partnering with LBNL Troubleshooting work for auditing and forensics. 07/05/2007
OSG and Related Grids 07/05/2007
Campus Grids • Sharing across compute clusters is a change and a challenge for many Universities. • OSG, TeraGrid, Internet2, Educause working together 07/05/2007
OSG and TeraGrid Complementaryandinteroperatinginfrastructures 07/05/2007
International Activities • Interoperate with Europe for large physics users. • Deliver the US based infrastructure for the World Wide Large Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments. • Include off-shore sites when approached. • Help bring common interfaces and best practices to the standards forums. 07/05/2007
ApplicationsandStatus of Utilization 07/05/2007
Particile Physics and Computing • Science Driver Event rate = Luminosity x Crossection • LHC Revolution starting in 2008 • Luminosity x 10 • Crossection x 150 (e.g. top-quark) • Computing Challenge • 20PB in first year of running • ~ 100MSpecInt2000 ~ close to 100,000 cores 07/05/2007
CMS Experiment Taiwan UK Italy Purdue Wisconsin UCSD Caltech Florida CMS Experiment (P-P Collision Particle Physics Experiment) OSG EGEE CERN USA@FNAL France Germany UNL MIT Data & jobs moving locally, regionally & globally within CMS grid. Transparently across grid boundaries from campus to globus. 07/05/2007
CMS Data Analysis 07/05/2007
Opportunistic Resource Use • In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. • The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. • The Council members agreed to contribute resources to meet this request. 07/05/2007
D0 Throughput D0 Event Throughput D0 OSG CPUHours / Week 07/05/2007
Lessons Learned from D0 Case • Consortium members contributed significant opportunistic resources as promised. • VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. • Combined teams make large production runs effective. • How does this scale? • how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon. 07/05/2007
Use Case by Other Disciplines • Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April • Weather Research Forecast: MPI job running on 1 OSG site; more to come • CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease • Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. • NanoHUB at Purdue: Biomoca and Nanowire production. 07/05/2007
OSG Usage By Numbers 39 Virtual Communities 6 VOs with >1000 jobs max. (5 particle physics & 1 campus grid) 4 VOs with 500-1000 max. (two outside physics) 10 VOs with 100-500 max (campus grids and physics) 07/05/2007
Running Jobs During Last Year 07/05/2007
Jobs Running at Sites >1k max 5 sites >0.5k max 10 sites >100 max 29 sites Total: 47 sites Many small sites, or with mostly local activity. 07/05/2007
CMS Xfer on OSG in June ‘06 450MByte/sec All CMS sites have exceeded 5TB per day in June 2006. Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day. 07/05/2007
Summary • OSG Facility utilization is steadily being increased • ~2-4500 jobs all the time • HEP, Astro, Nuclear Phys. but also Bio/Eng/Med • Constant effort/troubleshooting is being poured to make OSG usable, robust and performant. • Show use to other sciences. • Trying to bring campus into a pervasive distributed infrastructure. • Bring research into a ubiquitous appreciation of the value of (distributed, opportunistic) computation • Educate people to utilize the resources 07/05/2007
Out of Bound Slides 07/05/2007
Principle: Simple and Flexible The OSG architecture will follow the principles of symmetry and recursion wherever possible This principle guides us in our approaches to - Support for hierachies of and property inheritance of VOs. - Federations and interoperability of grids (grids of grids). - Treatment of policies and resources. 07/05/2007
Principle: Coherent but heterogeneous The OSG architecture is VO based. Most services are instantiated within the context of a VO. This principle guides us in our approaches to - Scope of namespaces & action of services. - Definition of and services in support of an OSG-wide VO. - No concept of “global” scope. - Support for new and dynamic VOs should be light-weight. 07/05/2007
OSG Security Activies (continues…) I trust it is the VO (or agent) Storage Data I trust it is the user I trust the job is for the VO VO infra. C E W W W W W W W W W Jobs I trust it is the user’s job W W W W W W VO User Site 07/05/2007
Principle: Bottom up/Persistency All services should function and operate in the local environment when disconnected from the OSG environment. This principle guides us in our approaches to - The Architecture of Services. E.G. Services are required to manage their own state, ensure their internal state is consistent and report their state accurately. - Development and execution of applications in a local context, without an active connection to the distributed services. 07/05/2007
Principle: Commonality OSG will provide baseline services and a reference implementation. The infrastructure will support support incremental upgrades The OSG infrastructure should have minimal impact on a Site. Services that must run with superuser privileges will be minimized Users are not required to interact directly with resource providers. 07/05/2007