250 likes | 385 Views
Center for High-End Computing Systems. Srinidhi Varadarajan Director. Motivation. We need a paradigm shift to make supercomputers more usable for mainstream computational scientists.
E N D
Center for High-End Computing Systems Srinidhi Varadarajan Director
Motivation • We need a paradigm shift to make supercomputers more usable for mainstream computational scientists. • A similar shift occurred in computing in the 1970s when the advent of inexpensive minicomputers into academia spurred a large body of computing research. • Results from this research went back to industry creating a growth cycle that lead computing being a commodity. • This requires a comprehensive “rethink” of programming languages, runtime systems, operating systems, scheduling, reliability and operations and management • Moving to petascale class systems significantly complicates this challenge. • Need a computing environment that can efficiently and usably span the scales from department sized systems to national resources.
Perspectives • Most of the “big iron” today is concentrated in DoD, DOE, NASA and NSF supercomputing centers. • Their mandate is national interest – their goal is to provide stable production cycles to computational scientists. • The future of supercomputing – research into supercomputing itself – is necessarily different from providing stable production cycles.
Vision • Our goal is to build a world class research group focused on high-end systems research. • This involves research in architectures, networks, power optimization, operating systems, compilers and programming models, algorithms, scheduling and reliability. • Our faculty hiring in systems is targeted to cover the breadth of these research areas. • The center is involved in research and development work, including design and prototyping of systems and development of production quality systems software. • The goal is to design and build the software infrastructure that makes HPC systems usable by the broad computational science and engineering community. • Provide support to high performance computing users on-campus. This involves the center in supporting actual applications, are then profiled to gauge the performance impact of its research.
Structure • CHECS was setup in the College of Engineering in Sep. 2005 • Funded by the College of Engineering • The Center consists of several core research labs with affiliated faculty. • Has affiliated faculty within and outside of CS with domain expertise. • Complemented by an industry affiliates program.
Research Labs • Computing Systems Research Lab (CSRL) • Distributed Systems and Storage Lab (DSSL) • Laboratory for Advanced Scientific Computing and Applications (LASCA) • Parallel Emerging Architectures Research Lab (PEARL) • Scalable Performance Laboratory (SCAPE) • Systems, Networking and Renaissance Grokking Lab (SyNeRGY)
Usability • Flows: Threads based distributed shared memory programming model • MPI On-Ramp: Removing the difficulties of mapping communication design abstractions to MPI code through visual tools and code generation • Operation stacking framework: algorithms and tools for improving the performance of large-scale ensemble computations • ReSHAPE: improving utilization and throughput on clusters via dynamically re-sizable parallel computations • Code Generation on Steroids: enhancing the functionality of automatically generated code through Generative Aspect Oriented Programming
Operating Systems • FlexiCache: Improving OS file system performance by developing an interface to support a repertoire of (pluggable) cache replacement polices in the kernel • Cadus: Co-Scheduling of real-time threads and garbage collection • Practical Fair-Sharing scheduling: finding and automatically adopting policies for stock kernels • ‘MAGNETizing’ SystemTap: Enabling dynamic, on-the-fly probing and export of kernel information
Power Aware Computing • High-performance, power-aware computing: frameworks for power, energy, and thermal measurement, analysis, and optimization • Frameworks: PowerPack, MISER • Supercomputing in small spaces: low-power & power-aware supercomputing
Architectures • Programming Layered Multiprocessors: a unified programming approach for layered shared-memory multiprocessors, with multithreaded or multicore execution components • MELISSES: Continuous hardware monitors for power-performance adaptation schemes on layered parallel architectures
Runtime Systems • Top: A framework for flexible, high-level instrumentation of binaries • DyniX: A framework for combined static/dynamic analysis of Java code • déjà vu: Transparent checkpointing and recovery for parallel applications • Weaves: Runtime system for adaptive compositional codes
Networks • High-performance networking: architecture, protocols, performance (modeling, evaluation, auto-tuning) in system-area & wide-area networks • Open Network Emulator: Integrated environment for simulation and direct-code execution of network protocols
Numerical Methods • Surrogate approximation: mathematical construction of functional approximations using sparse data in high dimensions, with ultimate application to multidisciplinary design optimization (MDO) • Robust design optimization: solving optimization problems with stochastic variables and constraints. • pDIRECT: massively parallel direct search algorithms for global optimization • Mathematical software for terascale machines: scalable algorithms for polynomial systems of equations, global optimization, MDO, and interpolatory approximation.
Applications • mpiBLAST: high-performance bioinformatics • Stochastic modeling: parameter estimation for stochastic cell cycle models • Remote sensing: parallel algorithms for remote sensing applications • WBCSim: a problem solving environment for wood based composites manufacturing processes.
Facilities • System X: 2200 processor PowerPC cluster with Infiniband interconnect • Anantham: 400 processor Opteron Cluster with Myrinet interconnect • Several 8-32 processor research clusters. • 12 processor SGI Altix shared memory system • 8 processor AMD Opteron shared memory system. • 16 core AMD Opteron shared memory system • 16 node Playstation 3 cluster • Currently building a 2400 core x86 cluster for research in power aware computing, programming models and fault tolerance.
CHECS Outreach • Training • Summer FDI on parallel computation: 2005, 2006, 2007. Average attendance of 15-20 faculty plus graduate students. • Offered 6-hour short courses on MPI and OpenMP parallel programming to graduate students each semester; average attendance of 15. • Anantham: approximately 500,000 jobs were run in 2007 alone; huge majority by CoE users including students working with Andrew Duggleby (ME), Walt O’Brien (ME), and David Cox (ChemE). • Visitors: Reed, Fowler, Munoz, Dongarra • Chair of System X allocation committee • Developing senior level course in Computational Science & Engineering, to be cross-listed with ESM
Industrial Impact • Developed industrial affiliates program modeled on MPRG. • In negotiations with Merrill Lynch to get them as the first affiliate. • Two venture funded startups originated from CHECS. • Created the Green500 list that ranks the most energy-efficient supercomputers.
Recent Achievements • 5 NSF CAREER awards • 2 DOE CAREER awards • 3 IBM Faculty awards • Dean’s Award for Excellence in Research • 2 VT Faculty Fellows • Best Paper Award at PPoPP • Won Storage Challenge at Supercomputing 2007 • 2 Faculty in HPCWire List of People to Watch in Supercomputing • 1 Faculty in MIT TR100 list of young researchers