1 / 17

Distributed Computing A Status Report

This presentation discusses challenges and solutions in distributed computing readiness for data analysis, highlighting production, data flow, software, and capacity projections. Insights and recommendations are shared based on past experiences and evolving models for improved production efficiency.

duanek
Download Presentation

Distributed Computing A Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed ComputingA Status Report Kaushik De University of Texas At Arlington Tier 2/Tier 3 Meeting, SLAC November 28, 2007

  2. Introduction • For this talk: Distributed computing == Production and Distributed Analysis (DA) • Question: are we ready for data in ~6 months? • What are the biggest challenges to Distributed Computing? • Lessons learned from 2 years of continuous production, distributed analysis experience, and data management • In this talk, I will concentrate on path to readiness • No details on future production plans (FDR etc) or computing model: covered in Jim’s talk • Also Rob and Michael’s talk on facilities organization • I will concentrate on open issues, leading to discussion • Discuss process and functional requirements Kaushik De

  3. Some High Level Issues • Can we handle MC production and data flow / data processing from ATLAS simultaneously? • Past experience from SC and M* raised many issues – work needed • How much do we have to scale up? • Expect number of users to increase factor of ~5? • Tier 2 resources will rise by factor of ~4? • Software releases/patches by factor of 10? Validation? • Missing functionalities? • Integrating Tier 3’s into computing model? • User analysis at Tier 2’s? Interactive analysis - Proof? • Recall Jim’s point – in the U.S. Tier boundaries are not rigid • BNL Tier 1 is also Tier 2 (MC prod, DA), and Tier 3 (Proof) • Tier 2’s do reprocessing (T1 task) and provide Tier 3 functionalities Kaushik De

  4. Site Hierarchy for Production Kaushik De

  5. Capacity Projections Kaushik De

  6. Production Facilities Issues • All Tier 2’s are running in steady production mode • But Tier 2’s need to scale up by factor of four in ~6 months • Tier 1 needs to scale up by a factor of three • Support user analysis at T2’s • Queues need to be setup – no road blocks anticipated • Need AOD replication - urgently • Interactive analysis – beyond Proof of concept! • Tier 3 contributions to production • Issues emerging with data transfer model (uberftp) • Working on solutions (pending SRM v2.2) • Tier 3 data distribution for end user analysis • We should not forget networking Kaushik De

  7. Production Software • PanDA – now mature product with stable team • But software is changing daily now to support non-U.S. sites: development team is stretched to limit • Need to rapidly expand/integrate production/support team • New architecture supports multiple Panda servers • Pathena – working very well • Users love it – had to increase CPU’s by factor of three recently • User support is becoming urgent issue – need to form new team • So far, shift team and developers provide support – will not scale • Challenge – scaling up from one to ten clouds • Scaling from one to two (adding Canada) was easy • Scaling from two to four (adding UK, France) going very slowly • Took 2 years to achieve smooth U.S. operations, ~6 months for rest Kaushik De

  8. Data Production/Processing • ATLAS managed production (MC, reprocessing) • Historically, U.S. has contributed ~25% of MC production • Tier 1 and Tier 2’s provide dedicated queues and storage for this • Physics groups directly manage task requests (we will have quotas/allocations per group arbitrated by RAC) • Detector, calibration, particle ID, test beam, commissioning… groups… will also have allocations • Regional U.S. production • Same as ATLAS managed production – physics groups define tasks needed by U.S. physicists with special group name (ex. ushiggs) • Panda manages quota (currently 20-25% for U.S. production) • So far, U.S. physicists have been slow in taking advantage of this (less than 25% of the quota allocated by RAC is being used) Kaushik De

  9. Panda Production Statistics CSC= Computing System Commissioning Kaushik De 9

  10. Panda Central Production Since 1/1/06 Since 10/1/07 Kaushik De

  11. Data Location Model • Tier 1 – main repository of data (MC & Primary) • Store complete set of ESD, AOD, AANtuple & TAG’s on disk • Fraction of RAW and all U.S. generated RDO data • Tier 2 – repository of analysis data • Store complete set of AOD, AANtuple & TAG’s on disk • Complete set of ESD data divided among 5 Tier 2’s • Data distribution to Tier 1 & Tier 2’s is managed • Tier 3 – unmanaged data matching local interest • Data through locally initiated subscriptions • Mostly AANtuple’s, some AOD’s • Tier 3’s will be associated with Tier 2 sites? • Tier 3 model is still not fully developed – evolving Kaushik De

  12. Storage Management • Tier 1 storage systems • Disk storage projected to grow by factor of three in ~6 months • Additional funding also expected from management reserve • During past few months many new issues have emerged • Disk/tape dcache pools – the default at BNL • Allows unlimited space – write pools automatically push data to tape • Does not work well for small files (log files), or volatile user output • Solution: new disk only pool was set up recently • Remaining issue: need tools to manage space (no longer infinite) • Does not work well for computing model – AOD, RDO, Evgen, DPD etc need to be on disk (large fraction of these got pushed to tape) • Solution: software to manage ‘pinning’ will be rolled out soon • Have not tackled Tier 2 storage issues yet! Kaushik De

  13. Data Management • DQ2 is on critical path • Many performance issues have been identified through operations • Central server load issues – still problem after Oracle migration • Fetcher performance issues – incomplete datasets, QoS • Essential features needed soon: hierarchical (container) datasets, lost file flag, tape handling… • We expect rapid improvements via new ADC organization • Expect higher priority for production and DA needs – since Panda was chosen for ATLAS wide use • Panda not using DQ2 for input file transfers – PandaMover • Need to integrate PandaMover with DQ2 • Test and implement LFC in the U.S. • Support problems with LRC used in U.S. – diverging from DQ2 Kaushik De

  14. Distributed Analysis Challenges • DA usage rising rapidly • Works very well – except data availability issues • But only available at BNL • We increased CPU allocation from ~200 to ~700 recently • Still not sufficient – ex. ~30k jobs waiting to run right now! • We need to bring Tier 2’s rapidly into DA activities • Show stopper: availability of AOD files • Also need: dedicated analysis queues, moving along well • Interactive analysis • Primarily expected at Tier 3’s • BNL and Wisconsin tests with Proof encouraging • Need to scale up and deploy rapidly • Many issues to understand: scaling, multi-user, data movement Kaushik De

  15. Process and Requirements • Well organized in U.S.: both development and operations • Tier 1 and Tier 2 requirements well understood • Functional requirements evolving thorough production operations and facilities integration program • Need to adapt quickly as we scale up rapidly • Some reorganization of operations will be needed • Need user support team • Often, issues are beyond U.S. control – software, trf, DQ2 etc: need help from new ADC organization Kaushik De

  16. Summary • Distributed computing working well in the U.S. • But many challenges still to overcome in short time • Expanding Panda ATLAS-wide is big task – but will help ATLAS in the long run • DDM and storage issues on critical path • Tier 2’s need to expand roles beyond MC production • Will everything be ready in ~6 months: still open question • Always, new people and new ideas are welcome Kaushik De

  17. Production – Live! http://panda.atlascomp.org/?dash=prod Kaushik De

More Related