390 likes | 568 Views
Tuesday, September 04, 2006. I hear and I forget, I see and I remember, I do and I understand. -Chinese Proverb. Today. Course Overview. Why Parallel Computing? Evolution of Parallel Systems. CS 524 : High Performance Computing. Course URL http://suraj.lums.edu.pk/~cs524a06
E N D
Tuesday, September 04, 2006 I hear and I forget, I see and I remember, I do and I understand. -Chinese Proverb
Today • Course Overview. • Why Parallel Computing? • Evolution of Parallel Systems.
CS 524 : High Performance Computing • Course URL http://suraj.lums.edu.pk/~cs524a06 • Folder on indus \\indus\Common\cs524a06 • Website – Check Regularly: Course announcements, office hours, slides, resources, policies … • Course Outline
Several programming exercises will be given throughout the course. Assignments will include popular programming models for shared memory and message passing such as OpenMP and MPI. • The development environment will be C/C++ on UNIX.
Pre-requisites • Computer Organization & Assembly Language (CS 223) • Data Structures & Algorithms (CS 213) • Senior level standing. • Operating Systems?
Hunger For More Power! • Endless quest for more and more computing power. • However much computing power there is, it is never enough.
Why this need for greater computational power? • Science, engineering, businesses, entertainment etc., all are providing the impetus. • Scientists – observe, theorize, test through experimentation. • Engineers – design, test prototypes, build.
HPC offers a new way to do science: Computation used to approximate physical systems - Advantages include: • Playing with simulation parameters to study emergent trends • Possible replay of a particular simulation event • Study systems where no exact theories exist
Why Turn to Simulation? When the problem is too . . . • Complex • Large • Expensive • Dangerous
Why this need for greater computational power? • Less expensive to carry out computer simulations. • Able to simulate phenomenon that could not be studied by experimentation. e.g. evolution of universe.
Why this need for greater computational power? • Problems such as: • Weather prediction. • Aeronautics (airflow analysis, structural mechanics, engine efficiency etc) . • Simulating world economy. • Pharmaceutical (molecular modeling). • Understanding drug receptor interactions in brain. • Automotive crash simulation. are all computationally intensive. • The more knowledge we acquire the more complex our questions become.
Why this need for greater computational power? • In 1995, the first full length computer animated motion picture, Toy Story, was produced on a parallel system composed on hundreds of Sun workstations. • Decreased cost • Decreased Time (Several months on several hundred processors)
Why this need for greater computational power? • Commercial Computing has also come to rely on parallel architectures. • Computer system speed and capacity Scale of business. • OLTP (Online transaction processing) benchmark represent the relation between performance and scale of business. • Rate performance of system in terms of its throughput in transactions per minute.
Why this need for greater computational power? • Vendors supplying database hardware or software offer multiprocessor systems that provide performance substantially greater than uniprocessor products.
One solution in the past: Make the clock run faster. • The advance of VLSI technology allowed clock rates to increase and larger number of components to fit on a chip. • However there are limits… Electrical signal cannot propagate faster than the speed of light: 30cm/nsec in vacuum and 20cm/nsec in copper wire or optical fiber.
Electrical signal cannot propagate faster than the speed of light: 30cm/nsec in vacuum and 20cm/nsec in copper wire or optical fiber. 10-GHz clock - signal path length 2cm in total 100-GHz clock - 2mm 1 THZ (1000 GHz) computer will have to be smaller than 100 microns if the signal has to travel from one end to the other and back with a single clock cycle.
Another fundamental problem: • Heat dissipation • The faster a computer runs: more heat it generates • High end Pentium systems: CPU cooling system bigger than the CPU itself.
Evolution of Parallel Architecture • New dimension added to design space: Number of processors. • Driven by demand for performance at acceptable cost.
Evolution of Parallel Architecture • Advances in hardware capability enable new application functionality, which places a greater demand on the architecture. • This cycle drives the ongoing design, engineering and manufacturing effort.
Evolution of Parallel Architecture • Microprocessor performance has been improving at a rate of about 50% per year. • A parallel machine of hundred processors can be viewed as providing to applications computing power that will be available in 10 years time. • 1000 processors 20 year horizon • The advantages of using small, inexpensive, mass produced processors as building blocks for computer systems are clear.
Technology trends • With technological advance, transistors, gates etc have been getting smaller and faster. • More can fit in same area. • Processors are getting faster by making more effective use of ever larger volume of computing resources. • Possibilities: • Place more computer system on chip including memory and I/O. (Building block for parallel architectures. System-on-a-chip) • Or multiple processors on chip. (Parallel architecture on single-chip regime)
Microprocessor Design Trends • Technology determines what is possible. • Architecture translates the potential of technology into performance. • Parallelism is fundamental to conventional computer architecture. • Current architectural trends are leading to multiprocessor designs.
Bit level Parallelism • From 1970 to 1986 advancements in bit-level parallelism • 4bit, 8 bit, 16 bit and so-on • Doubling the data path reduces the number of cycles required to perform an operation.
Instruction level Parallelism Mid 1980s to mid 1990s • Performing portions of several machine instructions concurrently. • Pipelining (kind of parallelism also) • Fetch multiple instructions at a time and issue them in parallel to distinct function units in parallel (superscalar)
Instruction level Parallelism However… • Instruction level parallelism is worthwhile only if processor can be supplied with instructions and data fast enough. • Gap between processor cycle time and memory cycle time has grown wider. • To satisfy increasing bandwidth requirements, larger and larger caches are placed on chip with the processor. • cache miss • control transfer • Limits
In mid 1970s, the introduction of vector processors marked the beginning of modern supercomputing • Perform operations on sequences of data elements rather than individual scalar data • Offered advantage of at least one order of magnitude over conventional systems of that time.
In late 1980s a new generation of systems came on market. These were microprocessor based supercomputers that initially provided about 100 processors and increased roughly to 1000 in 1990. • These aggregation of processors are known as massively parallel processors (MPPs).
Factors behind emergence of MPPs • Increase in performance of standard microprocessors • Cost advantage • Usage of “off-the-shelf” microprocessors instead of custom processors • Fostered by government programs for scalable parallel computing using distributed memory.
MPPs claimed to equal or surpass the performance of vector multiprocessors. • Top500 • Lists the sites that have the 500 most powerful installed computer systems. • LINPACK benchmark • Most widely used metric of performance on numerical applications • Collection of Fortran subroutines that analyze and solve linear equations and linear least squares problems
Top500 (Updated twice a year since June 1993) • In the first Top500 list there were already 156 MPP and SIMD systems present (around 1/3rd)
Some memory related issues • Time to access memory has not kept pace with CPU clock speeds. • SRAM • Each bit is stored in a latch made up of transistors • Faster than DRAM, but is less dense and requires greater power • DRAM • Each bit of memory is stored as a charge on a capacitor • 1GHz CPU will execute 60 instructions before a typical 60ns DRAM can return a single byte
Some memory related issues • Hierarchy • Cache memories • Temporal locality • Cache lines (64, 128, 256 bytes)
Parallel Architectures: Memory Parallelism • One way to increase performance is to replicate computers. • Major choice is between shared memory and distributed memory
Memory Parallelism • In mid 1980s, when 32-bit microprocessor was first introduced, computers containing multiple microprocessors sharing a common memory became prevalent. • In most of these designs all processors plug into a common bus. • However, a small number of processors can be supported by bus
UMA bus based SMP architecture • If the bus is busy, when a CPU wants to read or write memory, the CPU waits for CPU to become idle. • Contention of bus can be manageable for small number of processors only. • The system will be totally limited by bandwidth of the bus and most of the CPUs will be idle most of the time.
UMA bus based SMP architecture • One way to alleviate this problem is to add a cache to each CPU. • Less bus traffic if most reads can be satisfied from the cache and system can support more CPUs. • Single bus limits UMA microprocessor to about 16-32 CPUs.
SMP • SMP (Symmetric multiprocessor) • Shared memory multiprocessor where the cost of accessing a memory location is same for all processors.