1 / 25

The PanDA System

The PanDA System. Mark Sosebee University of Texas at Arlington dosar III workshop at OU September 22, 2006. Outline. What is PanDA? PanDA contributors PanDA goals and expectations System design Summary of PanDA components Monitoring Performance U.S. ATLAS and OSG

harrywells
Download Presentation

The PanDA System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The PanDA System Mark Sosebee University of Texas at Arlington dosar III workshop at OU September 22, 2006

  2. Outline • What is PanDA? • PanDA contributors • PanDA goals and expectations • System design • Summary of PanDA components • Monitoring • Performance • U.S. ATLAS and OSG • Using PanDA Mark Sosebee

  3. What is PanDA? • PanDA – Production and Distributed Analysis system • Project begun in August of 2005 • System developed by U.S. ATLAS team for MC production and distributed analysis: • Rapid development from scratch • Leveraged DC2/Rome experience • Inspired by Dirac & other systems • Extensively used for CSC production in the U.S. • Better scalability / usability compared to previous systems • Available for distributed analysis users now • “One-stop shopping” for all ATLAS computing users in U.S. Mark Sosebee

  4. PanDA Contributors • Project Coordinators: Torre Wenaus – BNL, Kaushik De – UTA • Lead Developer: Tadashi Maeno – BNL • PanDA team: • Brookhaven National Laboratory (BNL): Wensheng Deng, Alexei Klimentov, Yuri Smirnov, Xin Zhao • University of Texas at Arlington (UTA): Paul Nilsson,Nurcan Ozturk, Sudhamsh Reddy, Mark Sosebee • Oklahoma University (OU): Karthik Arunachalam, Horst Severini • University of Chicago (UC): Marco Mambelli • Lawrence Berkeley Lab (LBL): Martin Woudstra • DQ2 developers (CERN): Miguel Branco, David Cameron, Pedro Salgado Mark Sosebee

  5. PanDA Goals • Minimize dependency on unproven external components • PanDA relies only on proven and scalable external components – Apache, Python, MySQL • Backup/failover systems built into PanDA for all other components • Local pilot submission backs up CondorG scheduler • GridFTP backs up FTS • Integrated with ATLAS prodsys • Maximize automation – reduce operations crew • PanDA operating with smaller (half) shift team and already exceeded DC2/Rome production scale • Expect to scale production rate by factor of 10 without increasing operations support significantly • Additionally, provide support for distributed analysis users (hundreds of users in the U.S.) – without large increase in personnel Mark Sosebee

  6. PanDA Goals (cont.) • Efficient utilization of available hardware • So far, saturated all available T1 and T2 resources during sustained CSC production since early 2006 • Available number of CPU’s will increase this fall: PanDA expected to continue with full saturation • Demonstrate scalability required for ATLAS in 2008 • Already seen factor of four higher peak performance compared to DC2/Rome production • No scaling limits expected for another factor of 10-20 • Many techniques available if scaling limits reached – based on proven Apache technology, or deployment of multiple PanDA servers, or based on proven MySQL cluster technology Mark Sosebee

  7. PanDA Goals (cont. II) • Tight integration with DQ2 data management system • PanDA has been fully integrated with DQ2 from its beginning • PanDA has played a valuable role in testing and hardening DQ2 in a real distributed production environment • Very productive interactions between PanDA and DQ2 teams • Provide distributed analysis service for U.S. users • Integrated / uniform service for all U.S. grid sites • Variety of tools supported • Scalable system to support hundreds of users Mark Sosebee

  8. PanDA Design Mark Sosebee

  9. PanDA Components • Job Interface – allows injection of jobs into the system • Executor Interface – translation layer for ATLAS prodsys / prodDB • Task Buffer – keeps track of all active jobs (job state is kept in MySQL) • Brokerage – initiates subscriptions for a block of input files required by jobs (preferentially choose sites where data is already available) • Dispatcher – sends actual job payload to a site, on demand, if all conditions (input data, space and other requirements) are met • Data Service – interface to DQ2 Data Management system • Job Scheduler – send pilot jobs to remote sites • Pilot Jobs – lightweight execution environment to prepare CE, request actual payload, execute payload, and clean up • Logging and Monitoring systems – http and web-based • All communications through REST-style HTTPS services (via mod_python and Apache servers) Mark Sosebee

  10. PanDA Monitoring http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query Mark Sosebee

  11. PanDA Monitoring – Front Page Mark Sosebee

  12. PanDA Monitoring – Production Dashboard Mark Sosebee

  13. Job States in the System defined  assigned  activated  running  finished/failed (If input files are not available: defined  waiting. Once files are available, chain picks up again at “assigned” stage) defined : inserted in panda DB assigned : dispatchDBlock is subscribed to site waiting : input files are not ready activated: waiting pilot requests running : running on a worker node finished : finished successfully failed : failed Mark Sosebee

  14. Job States in the System (cont.) What triggers a change in job status? defined  assigned / waiting: automatic assigned  activated: received a callback which DQ2 site service sends when dispatchDBlock is verified. If a job doesn't have input files, it is activated without a callback. activated  running: picked up by a pilot waiting  assigned: received a callback which DQ2 site service sends when destionationDBlock is verified Mark Sosebee

  15. OSG Production in 1st Half 2006 • U.S. provides 24% of managed CSC production through PanDA • Half of U.S. production was done at Tier 1, remainder at Tier 2 sites • PanDA is also used for user analysis (through pathena) • Currently, PanDA has 54 analysis users who have submitted jobs through the grid in past 6 months • Analysis jobs primarily run at BNL. We are testing also at UTA. Soon all T2 sites will be enabled. Mark Sosebee

  16. Over 50 million events (all steps) completed so far in 2006 Mark Sosebee

  17. Panda CPU Usage Average ~600k Si2k CPU used by successful jobs per day on OSG Mark Sosebee

  18. Schedule • Currently, we average 1M events/week • We expect to ramp up to 2M events/week soon (in few weeks) once we start full Release 12 production • Goal: factor of 2 increase every 4 months after that • Achieve 10M events/week by Fall 2007 • Note: this is still below computing model target • Need to finish 20M events by end of 2006 • Need additional 50M in 1st quarter of 2007 • We are starting to look at storage issues – need to increase available storage capacity rapidly Mark Sosebee

  19. U.S. ATLAS & OSG • OSG includes many activities – ATLAS is but one user • ATLAS-OSG contributes 25% of CSC production • Organized as one large Tier 1 cloud: • BNL Tier 1 • Five regional Tier 2’s (most Tier 2’s have multiple sites, ~15 total) • Each Tier 2 site provides at least one ATLAS dedicated site • Resources at other non-ATLAS sites are available through OSG • Management: • ATLAS-OSG managed by U.S. ATLAS Computing Group • Clear line of responsibility: one Tier 1 Coordinator (Bruce Gibbard), one Tier 2 Coordinator (Rob Gardner) • Each Tier 2 has local coordinator – ultimately responsible for site • In addition, there are ATLAS-OSG software/operations coordinators Mark Sosebee

  20. Responsibilities • Tier 1 and Tier 2 sites provide services: • Batch systems (with local storage) through grid gatekeeper • Global storage systems accessible through DQ2 • Networking services • Any other services required by ATLAS (database…) • In general, only access through grid is guaranteed to ATLAS • Sites responsible for QoS (site monitoring) • U.S. ATLAS operations teams manage access to sites: • Production operations team - managed production (CSC), regional production (local groups using local resources), distributed analysis • DDM operations team • User support team Mark Sosebee

  21. Communication • U.S. Resource Allocation Committee (RAC) • Tier1-Tier2 meetings every 2 months (1-2 days) • Rotate among sites • Attended by management, site coordinators, and most importantly technical staff from all sites • Site requirements are discussed and decided jointly • At UC - http://indico.cern.ch/conferenceDisplay.py?confId=a062200 • At HU- http://indico.cern.ch/conferenceDisplay.py?confId=4897 • Dec 8 – next meeting at UTA • Site trouble ticket system • http://www.usatlas.bnl.gov/twiki/bin/view/Support/RTUsersGuide • Panda mailing list, various Savannah, various hypernews • Regular tutorials (examples: production, DDM, user…) Mark Sosebee

  22. Ongoing PanDA Development • PanDA core software in production – very few changes in past 3 months • Recent improvements – mostly for DA and monitoring • Robust local pilot job submission at BNL (CondorG at most Tier 2’s) • DQ2 enhancements to enable non ATLAS sites – under testing • Multi-tasking pilot running at BNL – to allow additional DA resources • User accounting and quota systems – operational • Improved error handling, robustness against grid failures • Goals for next 6 months • Increase number of sites served by PanDA • Try PanDA on LCG site (1st candidate – WestGrid) • Support for regional or local (U.S.) production – so far PanDA supports managed CSC production and user analysis • Support for additional Physics Analysis use cases • Support for calibration and other use cases Mark Sosebee

  23. Using PanDA for Distrbuted Analysis • Obtain a BNL user account: http://www.acf.bnl.gov/UserInfo/GettingStarted/ • Obtain a personal grid certificate: http://www.doegrids.org/pages/cert-request.html • Join the ATLAS VO: http://www.usatlas.bnl.gov/twiki/bin/view/Support/HowToAtlasVO • Try out the analysis tutorial: https://twiki.cern.ch/twiki/bin/view/Atlas/DAonPanda • Refer to the monitoring pages to see the status of your job(s) Mark Sosebee

  24. Conclusion • PanDA system working well on OSG • 15M events processed (through all steps) in 8 months • Many analysis users now using PanDA • Experienced PanDA operations team in place • Development team continues to improve PanDA • Hot off the presses: • New PanDA-OSG joint software development was funded (PI Torre Wenaus) – will add 2.5 FTE to software team • Two new Tier 2’s selected in U.S. – SLAC, Great Lakes • Dedicated ATLAS CPU’s on OSG is expected to double by the end of year; storage to increase by even larger factor Mark Sosebee

  25. Other URL’s O’Interest… • PanDA system main web page: https://twiki.cern.ch/twiki/bin/view/Atlas/Panda • U.S. ATLAS computing: http://www.usatlas.bnl.gov/USATLAS_TEST/Physics.shtml • The ATLAS computing “workbook”: https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBook • Horst’s OU Grid / Tier 2 page: http://www-hep.nhn.ou.edu/atlas/grid/ Mark Sosebee

More Related