280 likes | 524 Views
Distributed Computing Beyond The Grid. 5 th International Conference "Distributed Computing and Grid technologies in Science and Education ". Alexei Klimentov Brookhaven National Laboratory. Main topics. ATLAS experiment Grid and 2012 discovery From distributed computing to the Grid
E N D
Distributed Computing Beyond The Grid 5th International Conference "Distributed Computing and Grid technologies in Science and Education" Alexei Klimentov Brookhaven National Laboratory Grid2012 Conference. Dubna
Main topics • ATLAS experiment • Grid and 2012 discovery • From distributed computing to the Grid • MONARC model and computing model evolution • Progress in networking • Evolution of Data placement model • From planned replicas to global data access • Grids and Clouds • Cloud computing and virtualization • From Internet to …. Grid2012 Conference. Dubna
ATLAS The ATLAS Collaboration ATLAS We intend to fill this gap We use experimentsto inquire about what“reality” (nature) does • The goal is to understandin the most general; that’susually also the simplest. • - A. Eddington Theory • AThoroidalLHC ApparatuS • ATLAS is one of the six particle detectors experiments at Large Hadron Collider (LHC) at CERN, one of the two general purpose detectors • The project involves more than 3000 scientists and engineers in ~40 countries • ATLAS detector is 44 meters long and 25 meters in diameter, weighs about 7,000 tons. It is about half as big as the Notre Dame cathedral in Paris and weighs the same as the Eiffel Tower or a hundred 747 jets • The detector has 150 million sensors to deliver data • The collaboration is huge and highly distributed Grid2012 Conference. Dubna
Proton-Proton Collisions at the LHC → collisions every50 ns = 20 MHz crossing rate • 1.6 x 1011 protons per bunch • atLpk ~ 0.7x1034/cm2/s ≈ 35 pp interactions per crossing – pile-up → ≈ 109 pp interactions per second !!! • in each collision ≈ 1600 charged particlesproduced ATLAS RAW eventsize 1.2 MB ATLAS Reco Event Size 1.9 MB enormous challenge for the detectors and for data collection/storage/analysis • Research in High Energy Physics cannot • be done without computers. GRID Candidate Higgs decay to four electrons recorded by ATLAS in 2012. Grid2012 Conference. Dubna
The Complexity of the HEP Computing ATLAS Data at CERN 2010-Jun 2012 From the start on it was clear that no center could provide ALL computing even for one LHC experiment • Buildings, Power, Cooling, Money .. 15+ PBytes • Research in High Energy Physics cannot be done without computers. • Enormous data volume • The complexity of the data processing algorithms • The statistical nature of data analysis • several (re)processing and Monte-Carlo simulation campaigns per year Requires sufficient computing capacity • The Computing posed by LHC • Very large international collaborations • PetaBytes of data to be treated and analyzed The volume of data and the need to share them across collaboration is the key issue for LHC data analysis • ATLAS Computing requirements over time • 1995 : 100 TB disk space, 107 MIPS : Computing Technical Proposal • 2001: 1900 TB 7*107 MIPS : LHC Computing Review • 2007: 70000 TB 55*107 MIPS : Technical design report • 2010 LHC START • 2011: 83000 TB 61*107 MIPS Grid2012 Conference. Dubna
MONARC Model and LHC Grid Hierarchy in data placement • In 1998 MONARC project defined tiered architecture deployed later as LHC Computing Grid • a distributed model • Integrate existing centres, department clusters,recognising that funding is easier if the equipment is installed at home • Devolution of control– local physics groups have more influence over how local resources are used, how the service evolves • a multi-Tier model • Enormous data volumes looked after by a few (expensive) computing centres • Network costs favour regional data access • Simple model that HEP can develop and get into production ready for data in 2005 • Around the same time I.Foster and C.Kesselman proposed a general solution to distributed computing Grid2012 Conference. Dubna
WLCG • Tier-0 (CERN): (15%) • Data recording • Initial data reconstruction • Data distribution • Tier-1 (11 centres): (40%) • Permanent storage • Re-processing • Analysis • Connected by direct 10 Gb/s network links • Tier-2 (~200 centres): (45%) • Simulation • End-user analysis 15% 45% 40% Grid2012 Conference. Dubna
LHC Grid model over time *)K.De talk STATUS AND EVOLUTION OF THE ATLAS WORKLOAD MANAGEMENT SYSTEM PanDA later today • Role separation and decoupling • Separating archive and disk activities • Separating Tier0 and analysis storage. Analysis storage can evolve more rapidly w/o posing risks for high priority T0 tasks • CERN implementation – EOS • Mostly to eliminate CASTOR constraints • EOS high performance and highly scalable redundant disk storage, based on xrootd framework • intensively (stress)tested by ATLAS in 2010 /11. In production since 2011 TB ATLAS EOS space occupancy D.Duellmann talk Storage Strategy and Cloud Storage Evaluations Grid2012 July 18 • One of the major objectives reached was to enable physicists from all sites, large and small to access and to analyse LHC data • Grid complexity was masked to the users and still to deliver full functionality was one of the greatest challenges • PanDA* (ATLAS Production and Analysis workload management system) allows users to run analysis on the same way on local clusters and on the Grid • Evolution of data placement model • From planned replicas to dynamic data placement • Mantra “Jobs go to data” served well in 2010, but in 2011/12 ATLAS moved to dynamic data placement concept • Additional copies of the data made • Unused copies are cleaned • Data Storage Evolution Grid2012 Conference. Dubna
Progress in Networking • Network : • As important as site infrastructure • Key point to optimize storage usage and jobs brokering to sites • WAN is very stable and performance is good • It allows migration from hierarchy to mesh model • Production and analysis workload management system will use networking status/performance metrics to send jobs/data to sites • Progress in networking -> • Remote access via WAN is a reality • allowed to relax MONARC model LHCOPN LHC Optical Private Network • LHCONE an initiative for Tier-2 Network • Network providers, jointly working with the experiments, have proposed a new network model for supporting the LHC experiments, known as the LHC Open Network Environment (LHCONE). • The goal of LHCONE is to provide a collection of access locations that are effectively entry points into a network that is reserved to the LHC T1/2/3 sites. • LHCONE will complement LHCOPN. LHC Open Network Environment Grid2012 Conference. Dubna
Globalized Data Access • Canonical HEP strategy • “Jobs go to Data” • Data are partitioned between sites • Some sites are more important (get more important data) than others • Planned replicas • A dataset (collection of files produced under the same conditions and the same SW) is a unit of replication • Dataset size up to several TBytes • Data and replica catalogs are needed to broker jobs • Analysis job requires data from several sites triggers data replication and consolidation at one site or job splitting on several jobs running on all sites • A data analysis job must wait for all its data to be present at the site • The situation can easily degrade into a complex n-to-m matching problem • The CERN HEP pioneered the concept of “Data Grid” where data locality is the main feature • Data popularity concept • Used to decide when the number of replicas of a sample needs to be adjusted either up or down and replicate or clean-up • Dynamic storage usage • Analysis jobs waiting time decreased • But still we have extra transfers to increase number of dataset replicas • Directly accessing data ATLAS wide (world-wide in ideal case) could reduce the need of extra replicas and enhance the performance of the system and its stability Grid2012 Conference. Dubna
Storage Federation Federated ATLAS XROOTD (FAX) Deployment • Since Sep 2011 ~10 sites reporting to global federation (US ATLAS Computing project) • Performance studies were conducted with various caching options • Adoption if decent WAN performanceachievable • Subject the current set of sites to regular testing at significant analysis job scale • Monitoring of I/O • Evaluation file caching • With target to have event caching • Test being extended from regional to global • “Storage Federation” • Provide new access modes & redundancy • Jobs access data on shared storage resources via WAN • Share storage resources • And move to files and (even) event level caching • Do not replicate a dataset • Do not cache a dataset • Examining in T1-T2 context and also off-Grid T3 Grid2012 Conference. Dubna
CERN 2012 Discovery (and Grid) ATLAS Distributed Computing on the Grid : 10 Tier-1s + CERN + ~70 Tier-2s +… (more than 80 Production sites) 5h ATLAS 2012 datasets transfer time. Data are available for Physics analysis in ~ 5h Apr 1 – Jul 12 2012 Completed Grid Jobs ~830K daily average completed ATLAS Grid jobs Apr 1 – Jul 4 2012 Data Transfer Throughput (MB/s) All ATLAS sites Up to 6GB/s week average Analysis MC prod • On 4 July, 2012, the ATLAS experiment presented updated results on the search of Higgs Boson. FabiolaGianotti, ATLAS spokesperson : • “We observe in our data clear signs of a new particle, at the level of 5 sigma, in the mass region around 126 GeV…It would have been impossible to release physics results so quickly without outstanding performance of the Grid” • Available resources fully used/stressed • Very effective and flexible Computing Model and Operation -> accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from world-wide user Grid2012 Conference. Dubna
In meantime…. 1996 2004 1998 2006 Amazon EC2 The external world of computing is changing now as fast as it ever has and should open paths to knowledge in physics. HEP needs to be ready for new technical challenges posed both by our research demands and by external developments. Glen Crawford’s HEP Office DoE , introduction to CHEP 2012 While we were developing the Grid, the rest of the world had other ideas. Grid2012 Conference. Dubna
Balancing between stability and innovation T.Wenaus talk Technical Evolution in LHC Computing Grid2012 July 18 ATLAS Distributed Computing infrastructure is working. ATLAS Computing is facing challenges ahead as LHC performance is ramping up for the remainder of 2012 data taking • Our experience provides confidence that future challenges will be handled without compromising physics results • Despite of popular opinion, the Grid is doing us very well • 1500s ATLAS users process PetaBytesof data with billion of jobs • …but we are starting to hit some limits : • Databases scalability, CPU resources, storage utilization • We also need to learn lessons to watch what others are doing • …probably is it time for check up • Although there is no universal recipe Grid2012 Conference. Dubna
Grids and Clouds : Friends or Foes ? Grid2012 Conference. Dubna
Cloud Computing and Grid • For approximately one decade Grid computing has been hailed by many as “the next big thing” • Cloud computing is increasingly gaining popularity and it has become another buzzword (as Web 2.0) • We start to compute on centralized facilities operated by third-party compute and storage utilities • The idea is not new, in early sixties computing pioneers like John McCarthy predicted that “computation may someday be organized as a public utility” • Cloud and Grid computing vision is the same : • to reduce the cost of computing, increase reliability, and increase flexibility by transforming computers from something that we buy and operate ourselves to something that is operated by a third party. • But things are different than they were 10 years ago • We have experience with LHC data processing and analysis • We need to analyze Petabytes of LHC data • We found that it is quite expensive to operate commodity clusters • And Amazon, Google, Microsoft,… created real commercial large-scale systems containing hundreds of thousands of computers. • There is a long list of cloud computing definitions “A large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand to external customers over the Internet.” I.Foster et al Grid2012 Conference. Dubna
Relationship of Clouds with other domains • Web 2.0 covers almost the whole spectrum of service-oriented applications • Cloud Computing lies at the large-scale side. • Supercomputing and Cluster Computing have been more focused on traditional non-service applications. • Grid computing vision (and definition) is evolving with time and Fathers of the Grid see it slightly differently than a decade ago. • Grid Computing overlaps with all these fields where it is generally considered of lesser scale than supercomputers and Clouds. I.Foster et al. Cloud Computing and Grid Computing 360-Degree Compared Foster’s definition of Cloud computing overlaps with many existing technologies, such as Grid Computing, Utility Computing and Distributed Computing in general. Grid2012 Conference. Dubna
Cloud Computing and HEP • “Clouds” in ATLAS Distributed Computing • Integrated with Production System (PanDA*) • Transparent to end-users ATLAS Jobs running in IAAS CA site Monitoring page for ATLAS Clouds Pioneering work of Melbourne University Group (M.Sevior et al, ~2009) commercial cloud has been used to run Monte-Carlo simulation for BELLE experiment, around the same timeATLAS Cloud Computing R&D project was started Grid2012 Conference. Dubna
ATLAS Cloud Computing R&D EFFICIENCY, ELASTICITY • ATLAS Cloud Computing R&D Project • Goal: How we can integrate cloud resources with our current grid resources? • Data processing and workload management • Production (PanDA) queues in the cloud • Centrally managed, non-trivial deployment but scalable • Benefits ATLAS & sites, transparent to users • Tier3 analysis clusters: instant cloud sites • Institute managed, low/medium complexity • Personal analysis queue: one click, run my jobs • User managed, low complexity (almost transparent) • Data storage • Short term data caching to accelerate above data processing use cases • Transient data • Object storage and archival in the cloud • Integrate with data management system Grid2012 Conference. Dubna
Helix Nebula : The Science Cloud 2012/14 ATLAS has been chosen as one of the Helix Nebula flagships to make Proof of Concept deployments on three different commercial cloud providers. ATLAS setup • Partnership between European companies and research organizations (CERN, EMBL, ESA) • Establish a sustainable European cloud computing infrastructure to provide stable computing capacities and services that elastically meet demand • Pilot phase : Proof of Concept deployments on 3 commercial cloud providers (ATOS, CloudSigma, T-Systems) Grid2012 Conference. Dubna
ATLAS Cloud Computing and Commercial Cloud Providers • During this pilot phase ATLAS Distributed Computing has integrated Helix Nebula cloud resources into the PanDA workload management system, tested the cloud sites in the same way as any standard grid resource and finally ran MonteCarlo simulation jobs on several hundred cores on the cloud. • All deployments in the Proof of Concept phase have been successful and have been useful to identify the future directions for HelixNebula: • Agree on a common cloud model • Provide common interface • Understand common model for cloud storage • Involve other organizations or experiments • Understand costs, SLAs, legal constraints… ATLAS Jobs in Helix Clouds Grid2012 Conference. Dubna
ATLAS Cloud Computing. Achievements • We saw some good achievements already : • Production and analysis queues are ported to the cloud • Enable users to access extra computing resources on demand • Private and commercial cloud providers • PanDA submission is transparent for users • User can access new cloud resources with minimal changes to their analysis workflow on grid sites • Orchestrators for dynamic provisioning (i.e. adjust the size of the cloud according to existing demand) have been implemented and one is already in use for PanDA queues • Different storage options were evaluated (EBS, S3) and xrootd storage cluster in the cloud Grid2012 Conference. Dubna
Virtualization RIP CERNVM June 1996 • Virtualization technology offers an opportunity to decouple infrastructure, operating system and experiment software life-cycles • The concept is not new in Information Technology and it can be contended that the whole evolution of the computing machines is accompanied by a process of virtualization intended to offer a friendly and functional interface to the underlying hardware and software layers • Penalty in performance is a known issue for decades • It was reborn in age of Grid and Clouds. Virtualization has become an indispensable ingredient for almost every Cloud, the most obvious reasons are for abstraction and encapsulation. • Hardware factors also favour virtualization • Many more cores per processor • Need to share processors between applications • AMD and Intel have been introducing support for virtualization • One of interesting applications is a virtual Grid, when pool of virtual machines is deployed on top of physics hardware resources • Virtualization could be chosen for HEP data preservation Grid2012 Conference. Dubna
Cloud Computing. Summary • We have a vision about the future of Grid and Cloud Computing as fully complementary technologies that will coexist and cooperate at different levels of abstraction in e-infrastructures. • We are planning to incorporate virtualization and Cloud computing to enhance ATLAS Distributed Computing • Data processing • Many activities are reaching a point where we can start getting feedback from users. We should • Determine what we can deliver in production • Start focusing and eliminate options • Improve automation and monitoring • Still suffering from lack of standardization amongst providers • Cloud storage • This is the hard part • Looking forward to good progress in caching (xrootdin cloud) • Some “free” S3 endpoints are just coming online, so effective R&D is only starting now • We believe that finally there will be a Grid of Clouds integrated with the LHC Computing Grid, as we know it now • Right now we have Grid of Grids (LCG, NorduGrid, OSG) Grid2012 Conference. Dubna
Outlook • “It would have been impossible to release physics results so quickly without outstanding performance of the Grid” FabiolaGianotti Jul 4, 2012 • There is no one technology to fit all. There is no one stop solution • ATLAS Distributed Computing are pioneering Cloud R&D project and actively evaluates storage federation solution. ATLAS was the first LHC experiment implemented data popularity and dynamic data placement, we need to go forward to file and event level caching. Distributed Computing Resources will be used more dynamically and flexible, and it will make more efficient use of resources • Grid and Cloud Computing are fully complementary technologies that will coexist and cooperate at different levels of abstraction in e-infrastructures • The evolution of virtualization will ease the integration of clouds and Grid. • HEP data placement is moving to data caching and data access via WAN Grid2012 Conference. Dubna
From Internet to Gutenberg Hermes, the alleged inventor of writing, presented his invention to the Pharaoh Thamus, he praised his new technique that was supposed to allow human beings to remember what they would otherwise forget. XX century (TV, radio, telephone…) brought another culture, people watch the whole world under the form of imageswhich would have involved a decline of literacy. Computer screen is an ideal book on which one reads about the world in form of words and pages. If teen-agers by chance they want to program their own home computer, must know, or learn, logical procedures and algorithms, and must type words and numbers on a keyboard, at a great speed. In this sense one can say that the computer made us to return to a Gutenberg Galaxy. People who spend their night implementing an unending Internet conversation are principally dealing with words. From lecture presented by Umberto Eco (Italian philosopher, novelist, author “Il nomedellarosa”) Grid2012 Conference. Dubna
Summary • The last decade stimulated High Energy Physics to organize computing in a widely distributed way • Active participation in the LHC Grid service gives the institute (not just the physicist) a continuing and key role in the data analysis which is where the physics discovery happens • Encourages novel approaches to analysis .... ... and to the provision of computing resources • One of the major objectives reached was to enable physicists from all sites, large and small to access and to analyseLHC data Grid2012 Conference. Dubna
Acknowledgements Many thanks to my colleagues, F.Barreiro, I.Bird, J.Boyd, R.Brun, P.Buncic, F.Carminati, K.De, D.Duellmann, A.Filipcic, V.Fine, I.Fisk, R.Gardner, J.Iven, S.Jezequel, A.Hanushevsky, L.Robertson, M.Sevior, J.Shiers, D. van der Ster, H. von der Schmitt, I.Ueda, A.Vaniachine, T.Wenaus and many-many others for materials used in this talk. Grid2012 Conference. Dubna