240 likes | 361 Views
A Teraflop Linux Cluster for Lattice Gauge Simulations in India. N.D. Hari Dass Institute of Mathematical Sciences Chennai. Indian Lattice Community. IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass. IISC(Bangalore): Apoorva Patel. TIFR (Mumbai): Rajiv Gavai , Sourendu Gupta.
E N D
A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of Mathematical Sciences Chennai
Indian Lattice Community • IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass. • IISC(Bangalore): Apoorva Patel. • TIFR (Mumbai): Rajiv Gavai , Sourendu Gupta. • SINP (Kolkata): Asit De, Harindranath. • HRI (Allahabad): S. Naik • SNBOSE (Kolkata): Manu Mathur. • It is small but very active and well recognised. • So far its research mostly theoretical or small scale simulations except for international collaborations.
At the International Lattice Symposium held in Bangalore in 2000, the Indian Lattice Community decided to change this situation. • Form the Indian Lattice Gauge Theory Initiative(ILGTI). • Develop suitable infrastructure at different institutions for collective use. • Launch new collaborations that would make the best use of such infrastructure. • At IMSc we have finished integrating a 288-CPU Xeon Linux cluster. • At TIFR a Cray X1 with 16 CPU’s has been acquired. • At SINP plans are under way to have substantial computing resources.
Comput Nodes and Interconnect • After a lot of deliberations it was decided that the compute nodes shall be dual Intel Xeon@2.4 GHz. • The motherboard and 1U rackmountable chassis developed by Supermicro. • For the interconnect the choice was the SCI technology developed by Dolphinics of Norway.
Interconnect Technologies Application areas: WAN LAN I/O Memory Cache Processor FibreChannel SCSI Myrinet, cLan Design space for different technologies Proprietary Busses Infiniband Rapid IO Ethernet Hyper Transport PCI ATM Dolphin SCI Technology Bus Network Cluster Interconnect Requirements Application requirements: Distance 100 000 10 000 1 000 100 10 1 Bandwidth 1 100 10 000 50 000 100 000 100 000 Latency ∞ 100 000 1 000 20 1 1
SCI SCI PCI-SCI Adapter Card 1 slot 3 dimensions • SCI ADAPTERS (64 bit - 66 MHz) • PCI / SCI ADAPTER (D336) • Single slot card with 3 LCs • EZ-Dock plug-up module • Supports 3 SCI ring connections • Used for WulfKit 3D clusters • WulfKit Product Code D236 SCI LC LC LC PSB PCI
Theoretical Scalability with 66MHz/64bits PCI Bus Gbytes/s Courtesy of Scali NA
High PerformanceInterconnect: • Torus Topology • IEEE/ANSI std. 1596 SCI • 667MBytes/s/segment/ring • Shared Address Space • Maintenance and LAN Interconnect: • 100Mbit/s Ethernet • Channel Bonding option System Interconnects Courtesy of Scali NA
33 21 11 42 32 22 12 23 41 44 34 24 14 31 43 13 Scali’s MPI Fault Tolerance • 2D or 3D Torus topology • more routing options • XYZ routing algorithm • Node 33 fails (3) • Nodes on 33’s ringlets becomes unavailable • Cluster fractured with current routing setting Courtesy of Scali NA
32 14 44 31 21 11 34 41 12 42 23 13 43 24 22 33 SCAMPI Fault Tolerance cont. • Scali advanced routing algorithm: • From the “Turn Model” family of routing algorithms • All nodes but the failed one can be utilised as one big partition Courtesy of Scali NA
It was decided to build the cluster in stages. • A 9-node Pilot cluster as the first stage. • Actual QCD codes as well as extensive benchmarkings were run.
Kabru Configuration • Number of Nodes : 144 • Nodes: Intel Dual Xeon @ 2.4 GHz • Motherboard: Supermicro X5DPA-GG • Chipset: E7501 533 MHz FSB • Memory: 266 MHz ECC DDRAM • Memory: 2 GB/Node x 120+4 GB/N x 24 • Interconnect: Dolphin 3D SCI • OS: Red Hat Linux v.8.0 • Scali MPI
Physical Characterstics • 1U rackmountable servers • Cluster housed in 6 42U racks. • Each rack has 24 nodes. • Nodes connected in 6x6x4 3D Torus Topology . • Entire system in a small 400 sqft hall.
Communication Characterstics • With the PCI slot at 33MHz the highest sustained bandwidth between nodes is 165 MB/s on a packetsize of 16 MB. • Between processors on the same node it is 864 MB/s on a packet size of 98 KB. • With the PCI slot at 66 MHz these double. Lowest latency between nodes is 3.8 m s. Latency between procs on same node is 0.7 microsecs.
HPL Benchmarks • Best performance with GOTO BLAS and dgemm from Intel was 959 GFlops on all 144 nodes(problem size 183000). • Theoretical peak: 1382.4 GFlops. • Efficiency: 70% • With 80 nodes best performance was at 537 GFlops. • Between 80 and 144 nodes the scaling is nearly 98.5%
MILC Benchmarks • Numerous QCD codes with and without dynamical quarks have been run. • Independently developed SSE2 assembly codes for double precision implementation of MILC codes. • For the ks_imp_dyn1 codes we got 70% scaling as we went from 2 to 128 nodes with 1 proc/node, and 74% as we went from 1 to 64 nodes with 2 procs/node. • These were for 32x32x32x48 lattices in single precision.
MILC Benchmarks Contd. • For 64^4 lattices with single precision the scaling was close to 86%. • For double precision runs on 32^4 lattices the scaling was close to 80% as the number of nodes were increased from 4 to 64. • For pure-gauge simulations with double precision on 32^4 lattices the scaling was78.5% as one went from 2 to 128 nodes.
Physics Planned on Kabru • Very accurate simulations in pure gauge theory (with Pushan Majumdar) using the Luscher-Weisz multihit algorithm. • A novel parallel code both for Wilson loop as well as Polyakov loop correlators has been developed and preliminary runs carried out on lattices upto 32^4. • Requires 200 GB memory for 64^4 simulations with double precision.
Physics on Kabru….. • Using the same multihit algorithm we have a long term plan to carry out very accurate measurements of Wilson loops in various representations as well as their correlation functions to get a better understanding of confinement. • We also plan to study string breaking in the presence of dynamical quarks. • We propose to use scalar quarks to bypass the problems of dynamical fermions. • With Sourendu Gupta (TIFR) we are carrying out preliminary simulations on sound velocity in finite temperature QCD.