360 likes | 500 Views
Status of DØ Computing at UTA. DoE Site Visit Nov. 13, 2003 Jae Yu University of Texas at Arlington. Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing DØRAC DØSAR DØGrid Software Development Effort Impact on Outreach and Education Conclusions.
E N D
Status of DØ Computing at UTA DoE Site Visit Nov. 13, 2003 Jae Yu University of Texas at Arlington Introduction The UTA – DØ Grid team DØ Monte Carlo Production The DØ Grid Computing DØRAC DØSAR DØGrid Software Development Effort Impact on Outreach and Education Conclusions
Introduction • UTA has been producing DØ MC events as the US leader • UTA led the effort to • Start remote computing at DØ • Define remote computing architecture at DØ • Implement the remote computing design at DØ in the US • Leverage on experience as the ONLY active US DØ MC farm This became no longer true • UTA is the leader in US DØ Grid effort • The UTA DØ Grid team has been playing a leadership role in monitoring software development Status DØ Computing Effort DoE Site Visit, Jae Yu
The UTA-DØGrid Team • Faculty: Jae Yu, David Levine (CSE) • Research Associate: HyunWoo Kim • SAM/Grid expert • Development of McFarm SAM/Grid job manager • Software Program Consultant: Drew Meyer • Development, improvement, and maintenance of McFarm • CSE Master’s Degree Students: • Nirmal Ranganathan: Investigation of Resource needs in Grid execution • EE M.S. Student: Prashant Bhamidipati • MC Farm operation and McPerM development • PHY Undergraduate Student: David Jenkins • Take over MC Farm Operation and Development of Monitoring database • Graduated: • Three CSE MS students All are at industry • One CSE Undergraduate student on MS program at U. of Washington Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA DØ MC Production • Have two independent farms • Swift farm (HEP) • 36 P3 866MHz cpu’s • 250Mbyte/cpu • A total of .6TB disk space • CSE Farm • 12 P3 866MHz cpu’s • McFarm as our production control software • Statistics (11/1/2002 – 11/12/2003): • Produced: ~10M • Delivered: ~ 8M Status DØ Computing Effort DoE Site Visit, Jae Yu
What do we want to do with the data? Want to analyze data no matter where we are!!! Location and time independent analysis Status DØ Computing Effort DoE Site Visit, Jae Yu
DØ Data Taking Summary 30~40M events/mo Status DØ Computing Effort DoE Site Visit, Jae Yu
What do we need for efficient data analyses in a HEP experiment? • Total expected data size is ~4PB (4 million GB=100km of 100GB Hard drives)!!! • Detectors are complicated Need many people to construct and make them work • Collaboration is large and scattered all over the world • Allow software development at remote institutions • Optimized resource management, job scheduling, and monitoring tools • Efficient and transparent data delivery and sharing Status DØ Computing Effort DoE Site Visit, Jae Yu
DØ Collaboration 650 Collaborators 78 Institutions 18 Countries Status DØ Computing Effort DoE Site Visit, Jae Yu
Old Deployment Models Started with Fermilab-centric SAM infrastructure in place, … …transition to hierarchically distributed Model Status DØ Computing Effort DoE Site Visit, Jae Yu
Central Analysis Center (CAC) Normal Interaction Communication Path Occasional Interaction Communication Path …. RAC RAC Regional Analysis Centers Institutional Analysis Centers … ... IAC IAC IAC IAC Desktop Analysis Stations …. …. DAS DAS DAS DAS DØ Remote Analysis Model (DØRAM) Status DØ Computing Effort DoE Site Visit, Jae Yu
What is a DØRAC? • A large concentrated computing resource hub • An institute willing to provide storage and computing services to a few small institutes in the region • An institute capable of providing increased infrastructure as the data from the experiment grows • An institute willing to provide support personnel • Complementary to the central facility Status DØ Computing Effort DoE Site Visit, Jae Yu
KSU OU/LU KU UAZ Ole Miss UTA LTU Rice Mexico/Brazil DØ Southern Analysis Region (DØSAR) The first US Region centered around the UTA – RAC It is a regional virtual organization (RVO) within the greater DØ VO!! Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Institutions • Second Generation IAC’s • Cinvestav, Mexico • Universidade Estadual Paulista, Brazil • University of Kansas • Kansas State University • First Generation IAC’s • Langston University • Louisiana Tech University • University of Oklahoma • UTA • Third Generation IAC’s • Ole Miss, MS • Rice University, TX • University of Arizona, Tucson, AZ Status DØ Computing Effort DoE Site Visit, Jae Yu
Goals of DØ Southern Analysis Region • Prepare institutions within the region for grid enabled analyses using RAC at UTA • Enable IAC’s to contribute to the experiment as much as they can, including MC production and data re-processing • Provide GRID enabled software and computing resources to DØ collaboration • Provide regional technical support and help new IAC’s • Perform physics data analyses within the region • Discover and draw in more computing and human resources from external sources Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Workshops • Biennial Workshops to promote healthy regional collaboration and to share expertise • Had two workshops • April 18 – 19, 2003 at UTA: ~40 participants • Sept. 25 – 26, 2003 at OU: 32 participants • Each workshop had different goals and outcomes • Established SAR, RAC & IAC web pages and e-mail • Identified Institutional representatives • Enabled three additional IAC’s with MC production • Paired new institutions with existing ones Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Strategy • Setup all IAC’s with full DØ Software setup (DØRACE Phase 0 – IV) • Install Condor (or PBS) batch control system on desktop farms or clusters • Install McFarm MC Production control • Produce MC events on IAC machines • Install globus for monitoring information transfer • Install SAM-Grid and interface McFarm to it • Submit jobs through SAM/Grid and monitor them • Perform analysis at individual’s desk Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Software Status • Up-to-date with DØ Releases • McFarm MC Production control • Condor or PBS as batch control • Globus v2.xx for grid enabled communication • Globus & DOE SG Certificates obtained and installed • SAM/Grid on two of the farms (UTA IAC farms) Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA Software for SAR • McFarm Job control • All DØSAR institutions use this product for automated MC Production • Ganglia resource monitoring • Contains 7 clusters (332 CPU’s), including Tata institute, India • McFarmGraph: MC Job status Monitoring system using gridftp • Provides detailed information for a MC request • McPerM: MC Farm Performance Monitoring Status DØ Computing Effort DoE Site Visit, Jae Yu
1st SAR wrkshp Ganglia Grid Resource Monitoring Status DØ Computing Effort DoE Site Visit, Jae Yu
Job Status Monitoring: McFarmGraph Status DØ Computing Effort DoE Site Visit, Jae Yu
Increased Productivity Farm Performance Monitor: McPerM Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA RAC and Its Status • NSF MRI funded facility • Joint proposal of UTA HEP and CSE + UTSW Med. • 2 HEP, 10 CSE and 2 UTSW Medical • Core System (high throughput Research system) • CPU: 64 P4 Xeon 2.4GHz (total ~154 GHz) • Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 64 Gbytes) • Storage: 5TB Fiber Channel supported by 3 GFS servers (3Gbit/sec throughput) • Network: Faundary switch w/ 52 Gbit/sec + 24 100Mbit/sec ports • Expansion system (high CPU cycle, large storage Grid system) • CPU: 100 P4 Xeon 2.6GHz (total ~260 GHz) • Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 100 Gbytes) • Storage: 60TB IDE RAID supported by 10 NFS servers • Network: 52 Gbit/sec • The full facility went online on Oct. 31, 2003 • Software installation in progress • Plan to participate in SC2003 demo next week Status DØ Computing Effort DoE Site Visit, Jae Yu
Just to Recall Two Years Ago…. • IDE Hard drives are ~$2.5/GByte • Each set of IDE RAID array gives ~1.6TByte – hot swappable • Can be configured to have up to 10-16TB in a rack • Modest server can manage the entire system • Gbit network switch provide high throughput transfer to outside world • Flexible and scalable system • Need an efficient monitoring and error recovery system • Communication to resource management Gbit Switch IDE-RAID IDE-RAID . . . IDE-RAID IDE-RAID Disk Server Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA DØRAC • 84 P4 Xeon 2.4GHz CPU = 202 GHz • 7.5TB of Disk space • 100 P4 Xeon 2.6GHz CPU = 260 GHz • 64TB of Disk space • Total CPU: 462 GHz • Total disk: 73TB • Total Memory: 168Gbyte • Network bandwidth: 54Gb/sec Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Accomplishments • Held two workshops and the third is planned • All first generation institutions produce MC events using McFarm on desktop PC farms • Generated MC events: OU: 300k, LU: 250k, LTU: 150k, UTA: ~1.3M • Discovered additional resources • Significant local expertise have been accumulated in running farms and producing MC events • Produced several documents, including two DØ notes • Hold regular bi-weekly meetings (VRVS) to keep up progress • Working toward data re-processing Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Computing Resources Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Plans • Four second generation IAC’s have been paired with four first generation institutions • Success is defined as: • Regular production and delivery of MC events to SAM using McFarm • Install SAM/’Grid and perform a simple SAM job • Add all these new IAC’s to ganglia, McFarmGraph and McPerM • Discover and integrate more resources for DØ • Integrate OU’s OSCER cluster • Integrate other institution’s large, university-wide resources • Move toward grid enabled regional physics analyses • Collaborators need to be educated to use the system Status DØ Computing Effort DoE Site Visit, Jae Yu
Future Software Projects • Preparation of UTA DØRAC equipment • MC Production (DØ is suffering from shortage of resources.) • Re-reconstruction • SAM/Grid • McFarm • Integration of re-processing • Enhanced monitoring • Better error handling • McFarm Interface to SAM/Grid (job_manager) • Initial script successfully tested for SC2003 demo • Work with SAM-Grid team for monitoring database and integration of McFarm technology • Improvement and maintenance of McFramGraph and McPerM • Universal Graphical User Interface to Grid ( PHY PhD Student) Status DØ Computing Effort DoE Site Visit, Jae Yu
SAR Physics Interests • OU/LU: • EWSB/Higgs searches • Single top search • CPV / Rare decays in heavy flavors • SUSY • LTU: • Higgs search • B-tagging • UTA: • SUSY • Higgs searches • Diffractive physics • Diverse topics but can define common samples Status DØ Computing Effort DoE Site Visit, Jae Yu
Funding at SAR • Hardware Support • UTA – RAC : NSF MRI • UTA – IAC : DoE + Local • Totally independent of RAC resources • Need to more hardware to adequately support desktop analyses utilizing RAC resources • Software Support • Mostly UTA Local funding Will run out this year!!! • Many tries for different sources but none worked • We seriously need help to • Maintain the leadership in DØ Remote Computing • Maintain the leadership in grid computing • Realize the DØRAM and expeditious physics analyses Status DØ Computing Effort DoE Site Visit, Jae Yu
Tevatron Grid Framework: SAM-Grid • DØ already has data delivery part of the Grid system (SAM) • Project started in 2001 as part of the PPDG collaboration to handle DØ’s expanded needs. • Current SAM-Grid team includes: • Andrew Baranovski, Gabriele Garzoglio, Lee Lueking, Dane Skow, Igor Terekhov, Rod Walker (Imperial College), Jae Yu (UTA), Drew Meyer (UTA), HyunWoo Kim (UTA) in Collaboration with U. Wisconsin Condor team. • http://www-d0.fnal.gov/computing/grid • UTA is working on developing an interface for McFarm to SAM-Grid • This brings the entire SAR institutions + any institutions with McFarm into the DØGrid
Fermilab Grid Framework (SAM-Grid) UTA Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA-FNAL CSE Master’s Student Exchange Program • In order to establish usable Grid software in the DØ time scale, the project needs highly skilled software developers • FNAL cannot afford computer professionals • UTA - CSE department has 450 MS students Many are highly trained but back at school due to economy • Students can participate in cutting-edge Grid computing topics in real-life situation • Students’ Master’s thesis become a well documented record of the work which lacks in many HEP computing projects • The third generation students are at FNAL working on improvement of SAM – Grid and its implementation two semester circulation period • Previous two generations have made a significant impact to SAM – Grid • One of the four previous generation students is in PhD program at CSE • One at Wisconsin Condor team Possibility to get into PhD • Two are at industry Status DØ Computing Effort DoE Site Visit, Jae Yu
Impact to Education and Outreach • UTA DØ Grid program graduated • Trained: 12 (10 MS + 1 Undergraduate) students • Graduated: 5 CSE Masters + 1 Under grad • CSE Grid Course: Many class projects on DØ • Quarknet • UTA is one of the founding institutions of QuarkNet programs • Initiated TECOS project • Other School-top cosmic projects across the nation need storage and computing resources QuarkNet Grid • Will be working with QuarkNet for data storage & eventual use of computing resources by teachers and students • UTA Recently became a member of Texas grid (HiPCAT) • HEP is leading this effort • Strongly supported by the university • Expect significant increase in infrastructure, such as bandwidth Status DØ Computing Effort DoE Site Visit, Jae Yu
Conclusions • UTA DØ – Grid team has accomplished tremendously • UTA played a leading role in DØ Remote Computing • MC production • Design of DØ Grid architecture • Implementation of the DØRAM • DØ Southern Analysis Region is a great success • Four new institutions (3 US) are now MC production sites • Enabled exploitation of available intelligence and resources in an extremely distributed environments • Remote expertise being accumulated Status DØ Computing Effort DoE Site Visit, Jae Yu
UTA – DØRAC is up and running Software installation in progress • Soon to add significant resources to SAR and to DØ • Sam-Grid interface to McFarm working One step closer to establish a globalized grid • UTA – FNAL MS student exchange program is very successful • UTA DØ Grid computing program has significant impact to outreach and education • UTA is the ONLY DØ US institution who’s been playing a leading role in DØ grid Makes UTA unique • The local support runs out this year!! UTA needs support to maintain leadership in and support for DØ Remote Computing Status DØ Computing Effort DoE Site Visit, Jae Yu