greg astfalk woon yung chung woon-yung_chung@hp

high-end computing technology: where is it heading? greg astfalk woon yung chung woon-yung_chung@hp.com

prologue this is not a talk about hewlett-packard’s product offering(s) the context is hpc (high performance computing) somewhat biased to scientific computing also applies to commercial computing

backdrop end-users of hpc systems have needs and “wants” from hpc systems the computer industry delivers the hpc systems there exists a gap between the two wrt programming processors architectures interconnects/storage in this talk we (weakly) quantify the gaps in these 4 areas

end-users of hpc machines would ideally like to think and code sequentially have a compiler and run-time system that produces portable and (nearly) optimal parallel code regardless of processor count regardless of architecture type yes, i am being a bit facetious but the idea remains true end-users’ programming “wants”

parallelism methodologies there exists 5 methodologies to achieve parallelism automatic parallelization via compilers explicit threading pthreads message-passing mpi pragma/directive openmp explicitly parallel languages upc, et al.

parallel programming parallel programming is a cerebral effort if lots of neurons plus mpi constitutes “prime-time” then parallel programming has arrived no major technologies on the horizon to change this status quo

discontinuities the ease of parallel programming has not progressed at the same rate that parallel systems have become available performance gains require compiler optimization or pbo most parallelism requires hand-coding in the real-world many users don’t use any compiler optimizations

parallel efficiency mindful that the bounds on parallel efficiency are, in general, far apart 50% efficiency on 32 processors is good 10% efficiency on (100) processors is excellent >2% efficiency on (1000) processors is heroic a little communication can “knee over” the efficiency vs. processor count curve

apps with sufficient parallelism few existing applications can utilize (1000), or even (100), processors with any reasonable degree of efficiency to date have generally required heroic effort new algorithms (i.e., data and control decompositions) or nearly complete are necessary such large-scale parallelism will have “arrived” when msc/nastran and oracle exist on such systems and utilize the processors

latency tolerance will be a increasingly important theme for the future hardware will not solve this problem more on this point later developing algorithms that have significant latency tolerance will be necessary this means thinking “outside the box” about the algorithms simple modifications to existing algorithms generally won’t suffice latency tolerant algorithms

operating systems development environments will move to nt heavy-lifting will remain with unix four unix’s to survive (alphabetically) hp-ux linux aix 5l solaris linux will be important at the lower-end but will not significantly encroach on the high-end

end-users’ proc/arch “wants” all things being equal high-end users would likely want a classic cray vector supercomputer no caches multiple pipes to memory single word access hardware support for gather/scatter etc. it is true however that for some applications contemporary risc processors perform better

processors the “processor of choice” is now, and will be, for some time to come the risc processor risc processors have caches caches are good caches are bad if your code fits in cache, you aren’t supercomputing! 

a rule of thumb is that a risc processor, any risc processor, gets on average, on a sustained basis, 10% of its peak performance the 3 on this is large achieved performance varies with architecture application algorithm coding dataset size anything else you can think of risc processor performance

semiconductor processes semiconductor processes change every 2-3 years assuming that “technology scaling” applies to subsequent generations then per generation frequency increase of ~40% transistor density increase of ~100% energy per transition decrease of ~60%

semiconductor processes

what to do with gates it is not a simple question of what the best use of the gates is larger caches multiple cores specialized functional units etc. the impact of soft errors with decreasing design rule size will be a important topic what happens if a alpha particles flips a bit in a register?

processor futures you can expect, for the short term, moore’s law like gains in processor’s peak performance doubling of “performance” every 18-24 months does not necessarily apply to application performance moore’s law will not last forever 4-5 more turns (maybe?)

1995 1996 1997 1998 1999 2000 2001 2002 2003 Cisc Ia64 Ia32 Risc customer spending ($m) $40,000 $35,000 $30,000 $25,000 $20,000 $15,000 $10,000 $5,000 $0 idc, february 2000 • technology disruptions • risc crossed over cisc in 1996 • itanium will cross over risc in 2004

present high-end architectures today’s high-end architecture is either smp ccnuma cluster of smp nodes cluster of ccnuma nodes japanese vector system all of these architectures work efficiency varies with application type

architectural issues of the choices available the smp is preferred, however smp processor count is limited cost of scalability is prohibitive ccnuma addresses these limitations but induces its own disparate latencies better, but still limited, scalability ras limitations clusters too have pros and cons huge latencies low cost etc.

physics limitations imposed by physics have led us to architectures that have a deep memory hierarchy the algorithmist and programmer must deal with, and exploit, the hierarchy to achieve good performance this is part of the cerebral effort of parallel programming we mentioned earlier

memory hierarchy typical latencies for today’s technology

balanced system ratios a “ideal” high-end system should be balanced wrt its performance metrics for each peak flop/second 0.5–1 byte of physical memory 10–100 byte of disk capacity 4–16 byte/sec of cache bandwidth 1–3 byte/sec of memory bandwidth 0.1–1 bit/sec of interconnect bandwidth 0.02–0.2 byte/sec of disk bandwidth

balanced system applying the balanced system ratios to a unnamed contemporary 16 processor smp

storage data volumes are growing at a extremely rapid pace disk capacity sold doubled from 1997 to 1998 storage is a increasingly large percent of the total server sale disk technology is advancing too slowly per generation, of 1-1.5 years; access time decreases 10% spindle bandwidth increases 30% capacity increases 50%

networks only the standards will be widely deployed gigabit ethernet gigabyte ethernet fibre channel (2x and 10x later) sio atm dwdm backbones the “last mile” problem remains with us inter-system interconnect for clustering will not keep pace with the demands (for latency and bandwidth)

vendor’s constraints rule #1: be profitable to return value to the shareholders you don’t control the market size you can only spend ~10% of your revenue on r&d don’t fab your own silicon (hopefully) you must be more than just a “technical computing” company to not do this is to fail to meet rule #1 (see above)

market sizes according to the industry analysts the technical market is, depending on where you draw the cut-line, $4-5 billion annually the bulk of the market is small-ish systems (data from forest baskett at sgi)

a perspective commercial computing is not a enemy without the commercial market’s revenue our ability to build hpc-like systems would be limited the commercial market benefits from the technology innovation in the hpc market is performance “left on the table” in designing a system to serve both the commercial and technical markets yes

why? lack of a cold war performance of hpc systems has been marginalized in the mid-70s how many applications ran faster on a vax 11/780 than the cray-1 none how many applications today run faster on a pentium than the cray t90? some current demand for hpc systems is elastic

future prognostication computing in the future will be all about data and moving data the growth in data volumes is incredible richer media types (i.e., video) means more data distributed collaborations imply moving data e-whatever requires large, rapid data movement more flops  more data

data movement the scope of data movement encompasses: register to functional unit cache to register cache to cache memory to cache disk to memory tape to disk system to system pda to client to server continent to continent all of these are going to be important

epilogue for hpc in the future it is going to be risc processors smp and ccnuma architectures smp processor count relatively constant technology trends are reasonably predictable mpi, pthreads and openmp for parallelism latency management will be crucial it will be all about data

epilogue (cont’d) for the computer industry in the future trending toward “e-everything” e-commerce apps-on-tap brokered services remote data virtual data centers visualization nt for development vectors are dying for hpc vendors in the future there will be fewer 

conclusion hpc users will need to yield more to what the industry can provide rather than vice-versa vendor’s rule #1 is a cruel master

greg astfalk woon yung chung woon-yung_chung@hp

greg astfalk woon yung chung woon-yung_chung@hp

Presentation Transcript

CCC Ming Yin College Ching Chung Hau Po Woon Secondary School English Language Education Section

Yong-Woon KIM (qkim@etri.re.kr) HyoungJun KIM (khj@etri.re.kr) TTA

Yong-Woon KIM (qkim@etri.re.kr) Hyoungjun KIM (khj@etri.re.kr) TTA

2013 UN DESD Student Summer Institute Ching Chung Hau Po Woon Secondary School

Brazil, 2011/09/11 Yong-Woon KIM qkim@etri.re.kr

Kang-Min Choi 1) , Woon-Hak Kim 2) and In-Won Lee 3)

Yong-Woon KIM (qkim@etri.re.kr) Hyoungjun KIM (khj@etri.re.kr) TTA

Joo-Woon Lee Department of Biomedical Engineering The University of Texas at Austin

Soh Chin Woon Loh Mui Hoon Rosmiati Cheryl Chan Atiqah