E N D
1. SupercomputersNew Perspective Prof. Sin-Min Lee
Department of Computer Science
22. Multiple Processor Organization Single instruction, single data stream - SISD
Single instruction, multiple data stream - SIMD
Multiple instruction, single data stream - MISD
Multiple instruction, multiple data stream- MIMD
23. Single Instruction, Single Data Stream - SISD Single processor
Single instruction stream
Data stored in single memory
Uni-processor
24. Single Instruction, Multiple Data Stream - SIMD Single machine instruction
Controls simultaneous execution
Number of processing elements
Lockstep basis
Each processing element has associated data memory
Each instruction executed on different set of data by different processors
Vector and array processors
25. Multiple Instruction, Single Data Stream - MISD Sequence of data
Transmitted to set of processors
Each processor executes different instruction sequence
Never been implemented
26. Multiple Instruction, Multiple Data Stream- MIMD Set of processors
Simultaneously execute different instruction sequences
Different sets of data
SMPs, clusters and NUMA systems
28. Taxonomy of Parallel Processor Architectures
30. MIMD - Overview General purpose processors
Each can process all instructions necessary
Further classified by method of processor communication
37. Tightly Coupled - SMP Processors share memory
Communicate via that shared memory
Symmetric Multiprocessor (SMP)
Share single memory or pool
Shared bus to access memory
Memory access time to given area of memory is approximately the same for each processor
38. Tightly Coupled - NUMA Nonuniform memory access
Access times to different regions of memroy may differ
39. Loosely Coupled - Clusters Collection of independent uniprocessors or SMPs
Interconnected to form a cluster
Communication via fixed path or network connections
40. Parallel Organizations - SISD
41. Parallel Organizations - SIMD
42. Parallel Organizations - MIMD Shared Memory
43. Parallel Organizations - MIMDDistributed Memory
46. Symmetric Multiprocessors with the following characteristics
Two or more similar processors of comparable capacity
Processors share same memory and I/O
Processors are connected by a bus or other internal connection
Memory access time is approximately the same for each processor
All processors share access to I/O
Either through same channels or different channels giving paths to same devices
All processors can perform the same functions (hence symmetric)
System controlled by integrated operating system
providing interaction between processors
Interaction at job, task, file and data element levels
47. SMP Advantages Performance
If some work can be done in parallel
Availability
Since all processors can perform the same functions, failure of a single processor does not halt the system
Incremental growth
User can enhance performance by adding additional processors
Scaling
Vendors can offer range of products based on number of processors
48. Block Diagram of Tightly Coupled Multiprocessor
49. History Cray Research founded in 1972.
Cray Computer founded in 1988.
1976 First product – Cray-1 (240,000,000 OpS). Seymour Cray personally invented vector register technology.
1985 Cray-2 (1,200,000,000 OpS, a 5-fold increase from Cray 1). Seymour is credited with immersion-cooling technology
Cray-3 used revolutionary new gallium arsenide integrated circuits for the traditional silicon ones
1996 Cray was bought by SGI
In March 2000 the Cray Research name and business was sold by SGI to Tera Inc.
50. Visual Tour
51. Market Segment (from top500)
52. Supercomputer Architecture
53. Current Cray Products Cray X1 is the only Cray’s product with a unique vector CPU
Competitors are: Fujitsu, NEC, HP
Cray XT3 and XD1 use AMD Opteron CPUs (series 100 and series 200 accordingly)
You can find full product specifications as well as additional information on current systems at www.cray.com
54. Performance Measurements Performance is measured in teraflops
Linpack is a standard benchmark
Performance is also measured in memory bandwidth & latency, disk performance, interconnects, internal IO, reliability, and others
For example:
My home system, Athlon 750, gives about 34 megaflops (34*10^6 flops)
Current mid-range supercomputers give about 40 teraflops(40*10^12 flops) which is 1,176,470 times faster
55. Scalable Architecture in XT-3
56. Is Cray a good deal? Typical Cost approximately $30 million and above
Useful lifetime – 6 years
Most customers use supercomputers at 90% - 98% load
Clustered supercomputers and machines build around common desktop components (AMD/Intel CPUs, memory chips, motherboards, and etc.) are significantly cheaper
57. Future Cray’s “Red Storm” System in Sandia National Laboratories is running on Linux OS
Current Cost $90 million
Uses 11,648 AMD Opteron CPUs
Current operational speed – 41.5 teraflops
Uses unique SeaStar chip, which passes messages between thousands of CPUs
Upgrades are scheduled to be completed by the end of 2005 using dual-core Opteron
Expected to reach 100 teraflops by the end of 2005
64. We currently have two Cray T3E supercomputers, which are used to runthe daily weather forecasts and to run large scale climate studies.These are both Massively Parallel Processor (MPP) systems, which is tosay that they contain a large number of processors (CPUs), each withit's own separate portion of volatile RAM which serves as mainmemory.The individual portions of memory are relatively low, 128 MB perprocessor on one T3E and 256 MB on the other, but because of thenumbers of processors involved this gives total memory sizes of 118 GBon the T3E-900 and 168 GB on the T3E-1200E respectively.In order to run programs with very high memory requirements, it isnecessary for the programmer to break down the forecast or climatedata into smaller sections and distribute it across a number ofprocessors. Each processor can access it's own local memory withnormal load/store operations, but data on held remote processors mustbe accessed using special software routines, such as the MessagePassing Interface (MPI) or Cray's SHMEM system.The T3Es run an operating system called Unicos/mk. This is based onthe Unix operating system, but it has been extensively modified toallow it to run across a large number of processors simultaneously.Unicos/mk is best thought of as an interactive system which has thecapacity to run batch work, rather than the other way round. Thebatch facilities are provided by an additional piece of Cray softwarecalled the Network Queueing System (NQS).For reasons of efficiency, the majority of the workload on the Crays,including the weather forecasts, are run in batch mode. This allowsus to ensure that the supercomputers are run at full capacity overweekends and holiday periods, as well as allowing us to allocate extracomputing power to our scientists when it is not required for toproduce the daily forecasts.
65. We currently have two Cray T3E supercomputers, which are used to runthe daily weather forecasts and to run large scale climate studies.These are both Massively Parallel Processor (MPP) systems, which is tosay that they contain a large number of processors (CPUs), each withit's own separate portion of volatile RAM which serves as mainmemory.The individual portions of memory are relatively low, 128 MB perprocessor on one T3E and 256 MB on the other, but because of thenumbers of processors involved this gives total memory sizes of 118 GBon the T3E-900 and 168 GB on the T3E-1200E respectively.In order to run programs with very high memory requirements, it isnecessary for the programmer to break down the forecast or climatedata into smaller sections and distribute it across a number ofprocessors. Each processor can access it's own local memory withnormal load/store operations, but data on held remote processors mustbe accessed using special software routines, such as the MessagePassing Interface (MPI) or Cray's SHMEM system.The T3Es run an operating system called Unicos/mk. This is based onthe Unix operating system, but it has been extensively modified toallow it to run across a large number of processors simultaneously.Unicos/mk is best thought of as an interactive system which has thecapacity to run batch work, rather than the other way round. Thebatch facilities are provided by an additional piece of Cray softwarecalled the Network Queueing System (NQS).For reasons of efficiency, the majority of the workload on the Crays,including the weather forecasts, are run in batch mode. This allowsus to ensure that the supercomputers are run at full capacity overweekends and holiday periods, as well as allowing us to allocate extracomputing power to our scientists when it is not required for toproduce the daily forecasts.
66. > decision to make. We can rewrite our codes for each computer system we> buy in order to make it run really fast and efficiently on that system.> This costs us a lot of money in employing people to rewite our code.> Alternatively we can write code that may not be as fast and efficient,> but that runs well on many different systems. This is closer to the> approach we actually take - in reality it's a bit of both - but as far> as possible we aim for code that is highly portable to different> computers.> > Slide 32:> Again a very complex idea to explain! Can't think off the top of my head> of a good analogy to use here. I'll let you know if I can think of> something...> > Slide 34:> Similar comments to 32> decision to make. We can rewrite our codes for each computer system we> buy in order to make it run really fast and efficiently on that system.> This costs us a lot of money in employing people to rewite our code.> Alternatively we can write code that may not be as fast and efficient,> but that runs well on many different systems. This is closer to the> approach we actually take - in reality it's a bit of both - but as far> as possible we aim for code that is highly portable to different> computers.> > Slide 32:> Again a very complex idea to explain! Can't think off the top of my head> of a good analogy to use here. I'll let you know if I can think of> something...> > Slide 34:> Similar comments to 32
67. Weather forecasting
105. Teacher’s note.
You could do this as a class exercise to show how parallel processing works.
Ask 12 pupils ..( if your class is smaller than 26 you can make this group smaller, as long as you have one person in this group you can demonstrate the concept) Call them Group A to write out the seven times table.
Ask another 12 pupils .. Call them Group B to write out one line of the seven times table. So Ann does 1*7
Branwen does 2*7, Sabeen does 3*7, Lucy does 4*7 and so on.
Then have one pupil time the first person in Group A to finish writing out the tables.
Have another pupil time how long it takes for everybody in Group B to have written down their answers.
Group B should be much faster.
This is the idea behind Parallel ProcessingTeacher’s note.
You could do this as a class exercise to show how parallel processing works.
Ask 12 pupils ..( if your class is smaller than 26 you can make this group smaller, as long as you have one person in this group you can demonstrate the concept) Call them Group A to write out the seven times table.
Ask another 12 pupils .. Call them Group B to write out one line of the seven times table. So Ann does 1*7
Branwen does 2*7, Sabeen does 3*7, Lucy does 4*7 and so on.
Then have one pupil time the first person in Group A to finish writing out the tables.
Have another pupil time how long it takes for everybody in Group B to have written down their answers.
Group B should be much faster.
This is the idea behind Parallel Processing
108. References http://research.microsoft.com/users/gbell/craytalk/sld066.htm
http://inventors.about.com/library/inventors/blsupercomputer.htm
http://americanhistory.si.edu/csr/comphist/cray.htm
http://web.mit.edu/invent/iow/cray.html
www.top500.org
http://www.spikynorman.dsl.pipex.com/CrayWWWStuff/
http://news.zdnet.co.uk/hardware/emergingtech/0,39020357,39162182,00.htm