270 likes | 281 Views
Learn about the high performance computing initiative at Louisiana Tech University and the resources available through LONI for researchers.
E N D
Louisiana Tech Site Report DOSAR Workshop VII April 2, 2009 www.dosar.org Michael S. Bryant Systems Manager, CAPS/Physics Louisiana Tech University
DOSAR Workshop VII Louisiana Tech University and LONI COMPUTING IN LOUISIANA
DOSAR Workshop VII Researchers at Louisiana Tech • High Energy Physics and Grid computing • Dr. Dick Greenwood, Dr. Lee Sawyer, Dr. Markus Wobisch • Michael Bryant (staff) • High Availability and High Performance Computing • Dr. Chokchai (Box) Leangsuksun • Thanadech “Noon” Thanakornworakij (Ph.D. student) • LONI Institute Faculty • Dr. Abdelkader Baggag (Computer Science) • Dr. Dentcho Genov (Physics, Electrical Engineering)
DOSAR Workshop VII High Performance Computing Initiative • HPCI is a campus-wide initiative to promote HPC and enable a local R&D community at Louisiana Tech • Started by Dr. Chokchai (Box) Leangsuksun in 2007 • Provides a highly computational infrastructure for local researchers that supports: • GPGPU and PS3 computing • Highly parallel and memory intensive HPC applications • Sponsored by Intel equipment donations and University Research funding
DOSAR Workshop VII Local Resources at Louisiana Tech • The HPCI infrastructure consists of three primary clusters • Intel 32-bit Xeon cluster (Azul) • 38 nodes (76 CPUs), Highly Available dual headnodes using HA-OSCAR • Sony Playstation 3 cluster • 25 nodes (8 cores per PS3 = 200 cores or SPEs) • Intel 64-bit Itanium 2 (IA-64) cluster • 7 nodes (14 CPUs), usually found in high-end HPC applications • Local LONI computing resources • Dell 5TF Intel Linux cluster (Painter) • 128 nodes (512 CPUs total) • IBM Power5 AIX cluster (Bluedawg) • 13 nodes (104 CPUs)
DOSAR Workshop VII Louisiana Optical Network Initiative Louisiana is fast becoming a leader in the knowledge economy. Through LONI, researchers have access to one of the most advanced optical networks in the country, along with the most powerful distributed supercomputing resources available to any academic community. - http://ww.loni.org • Over 85 teraflops of computational capacity • Around 250 TB of disk storage and 400 TB of tape • A 40Gb/sec fiber-optic network connected to the National LambdaRail (10Gb/sec) and Internet2 • Provides 12 high-performance computing clusters around the state (10 of which are online)
DOSAR Workshop VII LONI Computing Resources • 1 x Dell 50 TF Intel Linux cluster • 668 compute nodes (5,344 CPUs), RHEL4 • Two 2.33 GHz quad-core Intel Xeon 64-bit processors • 8 GB RAM per node (1GB/core) • 192 TB Lustre storage • 23rd on Top500 in June 2007 • Half of Queen Bee's computational cycles is contributed to TeraGrid • 6 x Dell 5 TF Intel Linux clusters housed at 6 LONI sites • 128 compute nodes (512 CPUs), RHEL4 • Two dual-core 2.33 GHz Xeon 64-bit processors • 4 GB RAM per node (1GB/core) • 12 TB Lustre storage • 5 x IBM Power5 AIX supercomputers housed at 5 LONI sites • 13 nodes (104 CPUs), AIX 5.3 • Eight 1.9 GHz IBM Power5 processors • 16 GB RAM per node (2GB/processor) Dell Cluster IBM Power5
DOSAR Workshop VII LONI cluster at Louisiana Tech • Painter: Dell Linux cluster • 4.77 Teraflops peak performance • Red Hat Enterprise Linux 4 • 10 Gb/sec Infiniband network interconnect • Located in the new Data Replication Center • Named in honor of Jack Painter who was instrumental in bringing Tech's first computer (LGP-30) to campus in the 1950’s • Pictures follow…
DOSAR Workshop VII PetaShare (left) and Painter (right)
DOSAR Workshop VII With the lights off…
DOSAR Workshop VII Front and back of Painter
DOSAR Workshop VII LONI and the Open Science Grid ACCESSING RESOURCES ON THE GRID
DOSAR Workshop VII OSG Compute Elements LONI_OSG1 LONI_LTU (not active) • Official LONI CE (osg1.loni.org) • Located at LSU • OSG 0.8.0 production site • Managed by LONI staff • Connected to “Eric” cluster • Opportunistic PBS queue • 64 CPUs out of 512 CPUs • The 16 nodes are shared with other PBS queues. • LaTech CE (ce1.grid.latech.edu) • Located at Louisiana Tech • OSG 1.0 production site • Managed by LaTech staff • Connected to “Painter” cluster • Opportunistic PBS queue • Highly Available • LTU_OSG’s (caps10) successor
DOSAR Workshop VII Current Status of LONI_OSG1 • Installed OSG 0.8.0 at the end of February 2008 • Running DZero jobs steadily • Need to setup local storage to increase SAMGrid job efficiency • PetaShare with BeStMan in gateway mode • Waiting to test at LTU before deploying at LSU Weekly MC production
DOSAR Workshop VII DZero Production in Louisiana • In roughly four years (2004-2008), we produced 10.5 million events. • In just one year with LONI resources, we have done 5.97 million events. • Note: caps10 (LTU_OSG) is included, but minimum impact Cumulative MC production
DOSAR Workshop VII Current Status of LONI_LTU • Installed Debian 5.0 (lenny) on two old compute nodes • Xen 3.2.1 • Virtualization hypervisor or virtual machine monitor (VMM) • DRBD 0.8 (Distributed Replicated Block Device) • Think “network RAID1” • Allows active/active setup with cluster FS (GFS or OCFS) • Can only use ext3 with active/passive setup, unless managed by Heartbeat CRM (which maintains a active/passive setup) • Heartbeat 2.1 • Part of the Linux-HA project • Node failure detection and fail-over/back software
DOSAR Workshop VII Xen High Availability Architecture Heartbeat DRBD Active/Active domU domU DRBD Active/Active domU domU Xen dom0 Xen dom0
DOSAR Workshop VII HA Test Results • Tests with Xen live migration of VMs running on DRBD partition were very successful • SSH connection was not lost during migration • Ping round-trip times did rise but only noticed a 1-2% packet loss • Complete state of system was moved • Due to a lack of a proper fencing device, a split-brain situation was not tested. This occurs when both nodes think the other has failed and both start the same resources. • A fencing mechanism ensures only one node is online by powering off (or rebooting) the failed node. • HA clusters without a fencing mechanism are NOT recommended for production use. • Plan to test with real CE soon
DOSAR Workshop VII Robust, highly available grid infrastructure Building a Tier3 Grid Services Site
DOSAR Workshop VII Tier3 Grid Services (T3gs) • We are focused on building a robust, highly available grid infrastructure at Louisiana Tech for USATLAS computing and analysis. • OSG Compute Element (grid gateway/head node) • GUMS for grid authentication/authorization • DQ2 Site Services for ATLAS data management • Load balanced MySQL servers for GUMS and DQ2 • Dedicated domain: grid.latech.edu
DOSAR Workshop VII New Hardware for T3gs • Our old compute nodes are limited by memory (2GB), hard disks (ATA100), and network connectivity (100Mb) • We hope to purchase: • 2 x Dell PowerEdge 2950 (or similar PowerEdge servers) • Used by Fermigrid and many others • External storage • Still investigating (vs. DRBD) • Store virtual machines and OSG installation, which is will be exported to the LONI cluster
DOSAR Workshop VII Storage Elements and PetaShare LOOKING AHEAD
DOSAR Workshop VII OSG Storage Elements • We have plenty of storage available through PetaShare but no way to access it on the grid. • Most challenging component because it involves multiple groups (LONI, LSU/CCT, PetaShare, LaTech) • Our plan is to install BeStMan on the Painter I/O and PetaShare server, where the Lustre filesystem for PetaShare is assembled. • BeStMan would then run in its gateway mode • Either that, or we develop a iRODS/SRB interface for BeStMan, which may happen later on anyway.
DOSAR Workshop VII PetaShare Storage • PetaShare is “a distributed data archival, analysis and visualization cyberinfrastructure for data-intensive collaborative research.” – http://www.petashare.org • Tevfik Kosar (LSU/CCT), PI of NSF grant • Provides 200 TB disk and 400 TB tape storage around the state of Louisiana • Employs iRODS (Integrated Rule-Oriented Data System) which is the successor to SRB • Initial 10 TB at each OSG site for USATLAS • More details tomorrow in Tevfik Kosar’s talk…
DOSAR Workshop VII OSG Roadmap • Establish three LONI OSG CEs around the state:
DOSAR Workshop VII Closing Remarks • In the last year, we’ve produced nearly 6 million DZero MC events on LONI_OSG1 at LSU. • Expanded our computing resources at LaTech with Painter and HPCI’s compute clusters. • In fact, with Painter alone we have over 18 times more CPUs than last year (28 -> 512). • We look forward to becoming a full Tier3 Grid Services site in early-mid summer.
DOSAR Workshop VII QUESTIONS / COMMENTS?