“UCF”

“UCF” Computing Capabilities at UNM HPC Timothy L. Thomas UNM Dept of Physics and Astronomy Timothy L. Thomas

Timothy L. Thomas

I have a 200K SU (150K LL CPU hour) grant from the NRAC of the NSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.

(Stolen from Andrew…) Peripheral Data Vs Simulation Simulation: Muons From Central Hijing (QM02 Project07) Data: Centrality by Perp > 60

(Stolen from Andrew…) Simulated Decay Muons • QM’02 Project07 PISA files (Central HIJING) • Closest cuts possible from PISA file to match data(PT parent >1 GeV/c, Theta Porig Parent 155-161) Investigating possibility of keeping only muon and parent hits for reconstruction. • 17100 total events distributed over Z=±10, ±20, ±38 More events available but only a factor for smallest error bar Zeff ~75 cm Not in fit "(IDPART==5 || IDPART==6) && IDPARENT >6 &&IDPARENT < 13 && PTHE_PRI >155 && PTHE_PRI < 161 && IPLANE == 1 && IARM == 2 && LASTGAP > 2002 && PTOT_PRI*sin(PTHE_PRI*acos(0)/90.) > 1."

Now at UNM HPC: • PBS • Globus 2.2.x • Condor-G / Condor • (GDMP) • …all supported by HPC staff. • In Progress: • A new 1.2 TB RAID 5 disk server, to host: • AFS cache  PHENIX software • ARGO file catalog (PostgreSQL) • Local Objectivity mirror • Globus 2.2.x (GridFTP and more…)

Pre-QM2002 experience • with globus-url-copy… • Easily saturated UNM • bandwidth limitations • (as they were at that time) • PKI infrastructure and • sophisticated error-handling • are a real bonus over bbftp. • (One bug, known at the time • is being / has been addressed.) • (at left: 10 streams used) KB/sec

Multi-Jet cross section (theory) calculations, run using Condor(/PVM)… Three years of accumulated CPU time on desktop (MOU) machines at HPCERC and at the University of Wisconsin. Very CPU-intensive calculations… 6- and 9-dimensional Monte Carlo integrations: A typical job runs for a week and produces only about 100 KB of output histograms, such as those displayed here.

LLDIMU.HPC.UNM.EDU Timothy L. Thomas

RAID op system issues • Easy re-installation / update of the op sys • Grub or Lilo? (MBR or /boot?) • Machine has an IDE CDROM (but not a burner)!!! • Rescue CDs and/or floppies… • Independence of RAID array • (1.5 hours for RAID 5 verification step) • Should install ext3 on the RAID. • Partitioning of the system disk: • Independence of /home area • Independence of /usr/local area? • Jonathan says: Linux can’t do more than 2 GB swap partition • Jonathan says: / /usr/local/ /home/ (me: /home1/ /home2/ …?) • NFS issues… • Synchronize UID/GIDs between RAID server and LL. Timothy L. Thomas

RAID op system issues • Compilers and glibc… Timothy L. Thomas

RAID op system issues • File systems… • What quotas? • Ext3? (Quotas working OK?) • ReiserFS? (Need special kernel modules for this?) Timothy L. Thomas

RAID op system issues • Support for the following apps: • Raid software • Globus… • PHENIX application software • Objectivity • gcc 2.95.3 • PostGress • Open AFS  Kerberos 4 Timothy L. Thomas

RAID op system issues • Security issues… • IP#: fixed or DHCP? • What services to run or avoid? • NFS… • Tripwire or equiv… • Kerberos (for Open AFS)… • Globus… • ipchains firewall rules; /etc/services; /etc/xinetd config; etc… Timothy L. Thomas

RAID op system issues • Application-level issues… • Which framework? Both? • Who maintains framework and how can this job be properly divided up among locals? • SHOULD THE RAID ARRAY BE PARTITONED, a la the PHENIX counting house buffer boxes’ /a and /b file systems? Timothy L. Thomas

Resources Assume 90KByte/event and 0.1GByte/hour/CPU Filtered event can be analyzed, but not ALL PRDF event Many trigger has overlap.

Rough calculation of real-data processing (I/O-intensive) capabilities: 10 M events, PRDF-to-{DST+x}, both mut & mutoo; assume 3 sec/event (*1.3 for LL), 200200 KB/event.  One pass: 7 days on 50 CPUs (25 boxes), using 56% of LL local network capacity.  My 200K “SU” (~150K LL CPU hours) allocation allows for 18 of these passes (4.2 months)  3 MB/sec Internet2 connection = 1.6 TB / 12 nights (MUIDN_1D1S&NTCN) (Presently) LL is most effective for CPU-intensive tasks: simulations can easily fill the 512 CPUs; e.g, QM02 Project 07. Caveats: “LLDIMU” is a front-end machine; LL worker node environment is different from CAS/RCS node ( P.Power…)

On UNM Grid activities T.L.Thomas Timothy L. Thomas

I have a 200K SU (150K LL CPU hour) grant from the NRAC of the NSF/NCSA, with which UNM HPC (“AHPCC”) is affiliated.

. CPU time used: ~ 33,000 LosLobos hours . Number of files handled: > 2200 files . Data moved to BNL: > 0.5 TB (globus-url-copy) (NOTE: In 2001, did even more (~110,000 hours), as an as an exercise… see http://thomas.phys.unm.edu/tlt/phenix_simulations/ ) . Comments: [from late summer... but still relevant] . Global storage and I/O (disk, network, network) management a headache; too human intensive. --> Throwing more people at the problem (i.e., giving people accounts at more remote sites) is not a particularly efficient way to solve this problem. . File naming standardessential (esp. for data base issues.) . I have assembled a (still rough; not included here) standard request form for DETAILED information... --> This could be turned into an automatic interface... A PORTAL (to buzz) . PWG contacts need to assemble as detailed a plan as they can, but without the kinds of system details that are probably going to be changed anyway. (e.g., "chunk" size hints welcome but may be ignored.) . Use of varied facilities requires flexibility, including an "ATM" approach --> Simulation database needs to reflect this complexity.

. Generator config / management needs to be somewhat more sophisticated. --> E.g., random seeds, "back-end" generation. . An big issue (that others may understand better): the relationship and interface between the simulation data base and other PHENIX data bases... . Multiple levels of logs actually helped bookkeeping! --> Perhaps 'pseudo-parallelism' is the way to go. . Emerging reality (one of the main motivations "Grid" technology): no one has enough computing when it's needed but everyone has too much when they don't need it, which is much of the time. More than enough computing to get the work done is out there; you don't need your own! BUT: they they are "out there", and this must be dealt with. ==> PHENIX can and should form its own IntraGrid.

Reality Check #1: Perpetual computing person-power shortage; this pertains to both software production and data production, both real and M.C. Given that, M.C. is presently way too much work. Simple Vision:Transparently distributed processing should allow us to optimize our use of production computing person- power. Observed and projected massive increases in network bandwidth make this a not-so-crazy idea. Reality Check #2: What? Distributed real-data reco? Get Real! (...?) Fairly Simple Vision:OK, OK: Implement Simple Vision for M.C. first, see how that goes. If one can process M.C., then one is perhaps 75% of the way to processing real data. (Objy write-back problem is one serious catch.)

(The following slides are from a presentation that I was invited to give to the UNM-led multi-institutional “Internet 2 Day” this past March…) Timothy L. Thomas

Internet 2 and the Grid The Future of Computing for Big Science at UNM Timothy L. Thomas UNM Dept of Physics and Astronomy Timothy L. Thomas

Grokking The Grid Grok v. To perceive a subject so deeply that one no longer knows it, but rather understands it on a fundamental level. Coined by Robert Heinlein in his 1961 novel, Stranger in a Strange Land. (Quotes from a colleague of mine…) Feb 2002:“This grid stuff is garbage.” Dec 2002:“Hey, these grid visionaries are serious!” Timothy L. Thomas

So what is a “Grid”? Timothy L. Thomas

Ensemble of distributed resources acting together to solve a problem: ”The Grid is about collaboration, about people working together.” • Linking people, computing resources, and sensors / instruments • Idea is decades old, but enabling technologies are recent. • Capacity distributed throughout an infrastructure • Aspects of Grid computing: • Pervasive • Consistent • Dependable • Inexpensive Timothy L. Thomas

Virtual Organizations (VOs) • Security implications • Ian Foster’s Three Requirements: • VOs that span multiple administrative domains • Participant services based on open standards • Delivery of serious Quality of Service Timothy L. Thomas

High Energy Physics Grids • GriPhyN (NSF) • CS research focusing on virtual data, request planning • Virtual Data Toolkit:Delivery vehicle for GriPhyN products • iVDGL: International Virtual Data Grid Laboratory (NSF) • A testbed for large-scale deployment and validation • Particle Physics Data Grid (DOE) • Grid-enabling six High-Energy/Nuclear Physics experiments • EU Data Grid (EDG): Applications areas… • Particle physics • Earth and planetary sciences: "Earth Observation“ • “Biology” • GLUE: Grid Laboratory Uniform Environment • Link from US grids to EDG grids Timothy L. Thomas

<<< Grid Hype >>> (“Grids: Grease or Glue?”) Timothy L. Thomas

Natural Grid Applications • High-energy elementary particle and Nuclear Physics (HENP) • Distributed image processing • Astronomy… • Biological/biomedical research; e.g., pathology… • Earth and Planetary Sciences • Military applications; e.g., space surveillance • Engineering simulations  NEES Grid • Distributed event simulations • Military applications; e.g., SF Express • Medicine: distributed, immersive patient simulations  Project Touch • Biology: complete cell simulations… Timothy L. Thomas

Processing requirements Two examples • Example 1: High-energy Nuclear Physics • 10’s of petabytes of data per year • 10’s of teraflops of distributed CPU power • Comparable to today’s largest supercomputers… Timothy L. Thomas

flow of data EpoDB GAIA Biological Databases: Complex interdependencies • Domino-effect in data publishing • Efficiently keep many versions GERD TRRD BEAD EpoDB BEAD Swissprot Swissprot GAIA EMBL GenBank Transfac Transfac DDBJ (Yong Zhao, University of Chicago)

“UCF”

“UCF”

Presentation Transcript