1 / 116

Lecture 1 (Overview)

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 1 (Overview). Course Objectives. A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems Emphasis on programming:

Download Presentation

Lecture 1 (Overview)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lecture 1 (Overview)

  2. Course Objectives • A hands-on opportunity to learn: • Multi-core architectures; and • Programming multi-core systems • Emphasis on programming: • Using multi-threading paradigm • Understand the complexities • Apply to generic computing/networking problems • Implement on an popular embedded multi-core platform 1-8

  3. Grading Policy and Reference Books • Grading Policy • Lectures (40%) • Labs (50%) • Quizzes (daily) (10%) • Reference material: • Shameem Akhtar and Jason Roberts,Multi-Core Programming, Intel Press, 2006 • David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 • Class notes 1-8

  4. Course Outline • Introduction • Parallel architectures and terminology • Context for current interest: multi-core processors • Programming paradigms for multi-core • Octeon processor architecture • Multi-threading on multi-core processors • Applications for multi-core processors • Application layer computing on multi-core • Performance measurement and tuning 1-8

  5. An Introduction to Parallel Computing in the context of Multi-Core Architectures

  6. Developing Software for Multi-Core: A Paradigm Shift • Application developers are typically oblivious of underlying hardware architecture • Sequential program • Automatic/guaranteed performance benefit with processor upgrade • No work on the programmer • No “free lunch” with multi-core systems • Multiple cores in modern processors • Parallel programs needed to exploit parallelim • Parallel computing is now part of main-stream 1-8

  7. Parallel Computing for Main-Stream: Old vs. New Programming Paradigms • Known tools and techniques: • High performance computing and communication (HPCC) • Wealth of existing knowledge about parallel algorithms, programming paradigms, languages and compilers, and scientific/engineering applications • Multi-threading for multi-core • Common in desktop and enterprise applications • Exploits parallelism of multi-core with its challenges • New realizations of old paradigms: • Parallel computing on Playstation 3 • Parallel computing on GPUs • Cluster computing for large volume data 1-8

  8. Dealing with the Challenge of Multi-Core Programming with Hands-On Experience • Our objective is two-fold • Overview the known paradigms for background • Learn using the state-of-the-art implementations • Choice of platform for hands-on experience • Cavium Networks’ Octeon processor based system • Multiple cores (1 to 16) • Suitable for embedded products • Commonly used in networking products • Standard Linux based development environment 1-8

  9. Agenda for Today • Parallel architectures and terminology • Processor technology trends • Architecture trends • Taxonomy • Why multi-core architectures? • Traditional parallel computing • Transition to multi-core architectures • Programming paradigms • Traditional • Recent additions • Introduction to Octeon processor based systems 1-8

  10. Architecture and Terminology Background of parallel architectures and commonly used terminology

  11. Architectures and Terminology • Objectives of this section: • Understand the processor technology trends • Realize that parallel architectures evolve based on technology and architecture trends • Terminology used in parallel computing • Von Neumann • Flynn’s taxonomy • Bell’s taxonomy • Other commonly used terminology 1-8

  12. Processor Technology Trends Processor technology evolution, Moore’s law, ILP, and current trends

  13. Processor Technology Evolution • Increasing number of transistors on a chip • Moore’s law: number of transistors on a chip is expected to double every 18 months • Chip densities are reaching their physical limits • Technological breakthroughs have kept Moore’s law alive • Increasing clock rates during 90’s • Faster and smaller transistors, gates, and circuits on a chip • Clock rates of microprocessors increase by ~30% per year • Benchmark (e.g., SPEC suite) results indicate performance improvement with technology 1-8

  14. Moore’s Law • Gordon Moore , Founder of Intel • 1965: since the integrated circuit was invented, the number of transistors/inch2 in these circuits roughly doubled every year this trend would continue for the foreseeable future • 1975: revised - circuit complexity doubles every 18 months • This was simply a prediction • Based on little data • However, it has defined the processor industry 1-8

  15. Moore’s Original Law (2) ftp://download.intel.com/research/silicon/moorespaper.pdf 1-8

  16. Moore’s Original Issues • Design cost still valid • Power dissipation still valid • What to do with all the functionality possible ftp://download.intel.com/research/silicon/moorespaper.pdf 1-8

  17. Moore’s Law and Intel Processors From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif 1-8

  18. Good News: Moore’s Law isn’t done yet Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8

  19. Bad News:Single Thread Performance is Falling Off 1-8

  20. Worse News:Power (normalized to i486) Trend Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8

  21. Addressing Power Issues Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8

  22. Architecture Optimized for Power:a big step in the right direction Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8

  23. Long term solution: Multi-Core Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8

  24. Summary of Technology Trends • Moore’s law is still relevant • Need to deal with related issues • Design complexity • Power consumption • Uniprocessor performance is slowing down • Multiple processor cores resolve these issues • Parallelism at hardware level • End user is exposed to it • Added complexities related to programming such systems 1-8

  25. Taxonomy for Parallel Architectures Von Neumann, Flynn, and Bell’s taxonomies and other common terminology

  26. Von Neumann Architecture Evolution Von Neumann architecture Scalar Sequential Lookahead I/E overlap Functional parallelism Multiple func.units Pipeline Implicit vector Explicit vector Memory to memory Register to register SIMD MIMD Associative processor Processor array Multicomputer Multiprocessor Massively Parallel Processors 1-8

  27. Pipelining and Parallelism • Instructions prefetch to overlap execution • Functional parallelism supported by: • Multiple functional units • Pipelining • Pipelining • Pipelined instruction execution • Pipelined arithmetic computations • Pipelined memory access operations • Pipelining is attractive for performing identical operations repeatedly over vector data strings 1-8

  28. Flynn’s Classification • Michael Flynn classified architectures in 1972 based on instruction and data streams • Single Instruction stream over a Single Data stream (SISD) • Conventional sequential machines • Single Instruction stream over Multiple Data streams (SIMD) • Vector computers are equipped with scalar and vector hardware 1-8

  29. Flynn’s Classification (2) • Multiple Instruction streams over Single Data stream (MISD) • Same data flowing through a linear array of processors • Aka systolic arrays for pipelined execution of algorithms • Multiple Instruction streams over Multiple Data streams (MIMD) • Suitable model for general purpose parallel architectures 1-8

  30. Multicomputers Multiple address spaces System consists of multiple computers, called nodes Nodes are interconnected by a message-passing network Each node has its own processor, memory, NIC, and I/O devices Multiprocessors Shared address space Further classified based on how memory is accessed Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) Cache-Only Memory Access (COMA) Cache-Coherent Non-Uniform Memory Access (cc-NUMA) Bell’s Taxonomy for MIMD 1-8

  31. Multicomputer Genenrations • First generation (1983-87) • Processor boards connected in hypercube architecture • Software-controlled message switching • Examples: Caltech Cosmic, Intel iPSC/1 • Second generation (1988-1992) • Mesh connected architecture • Hardware message routing • Software environment for medium grain distributed computing • Example: Intel Paragon • Third generation (1993-1997) • Fine grain multicomputers • Examples: MIT J-Machine and Caltech Mosaic 1-8

  32. Multiprocessor Examples • Distributed memory (scalable) • Dynamic binding of address to processors (KSR) • Static binding, caching (Alliant, DASH) • Static program binding (BBN, Cedar) • Central memory (not scalable) • Cross-point or multi-stage (Cray, Fujitsu, Hitachi, IBM, NEC, Tera) • Simple multi bus (DEC, Encore, NCR, Sequent, SGI, Sun) 1-8

  33. Supercomputers • Supercomputers use vector processing and data parallelism • Classified into two categories • Vector supercomputers • SIMD supercomputers • SIMD machines with massive data parallelism • Instruction is broadcast to large number of Pes • Examples: Illiac (64 PEs), MasPar MP-1 (16,384 PEs), and CM-2 (65,538 PEs) 1-8

  34. Vector supercomputers • Machines with powerful vector processors • If decoded instruction is a vector operation, it is sent to vector unit • Register-register architecture: • Fujitsu VP2000 series • Memory-to-memory architecture: • Cyber 205 • Pipelined vector supercomputers: • Cray Y-MP 1-8

  35. Dataflow Architectures • Represent computation as a graph of essential dependences • Logical processor at each node, activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution 1-8

  36. Dataflow Architectures (2) 1-8

  37. Systolic Architectures • Replace single processor with array of regular processing elements • Orchestrate data flow for high throughput with less memory access 1-8

  38. Systolic Architectures (2) • Different from pipelining • Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips • Represent algorithms directly by chips connected in regular pattern 1-8

  39. ´ ´ ´ ´ y ( i ) = w 1 x ( i ) + w 2 x ( i + 1) + w 3 x ( i + 2) + w 4 x ( i + 3) x 8 x 6 x 4 x 2 x 3 x 1 x 7 x 5 w 4 w 3 w 2 w 1 y 3 y 2 y 1 x in x out x out = x x x = x in ´ y out = y in + w x in w y in y out Systolic Arrays (Cont’d) • Example: Systolic array for 1-D convolution • Practical realizations (e.g. iWARP) use quite general processors • Enable variety of algorithms on same hardware • But dedicated interconnect channels • Data transfer directly from register to register across channel • Specialized, and same problems as SIMD • General purpose systems work well for same algorithms (locality etc.) 1-8

  40. Cluster of Computers • Started as a poor man’s parallel system • In-expensive PCs • In-expensive switched Ethernet switch • Run-time system to support message-passing • Low performance for HPCC applications • High network I/O latency • Low bandwidth • Suitable for high throughput applications • Data center applications • Virtualized resources • Independent threads or processes 1-8

  41. Summary of Taxonomy • Multiple taxonomies • Based on functional parallelism • Von Neumann and Flynn’s taxonomies • Based on programming paradigm • Bell’s taxonomy • Parallel architecture types • Multi-computers (distributed address space) • Multi-processors (shared address space) • Multi-core • Multi-threaded • Others: vector, data flow, systolic, and cluster 1-8

  42. Why Multi-Core Architectures Based on technology and architecture trends

  43. Multi-Core Architectures • Traditional architectures • Sequential  Moore’s law = increasing clk freq • Parallel  Diminishing returns from ILP • Transition to multi-core • Architecture  similar to SMPs • Programming  typically SAS • Challenges to transition • Performance = efficient parallelization • Selecting a suitable programming paradigm • Performance tuning 1-8

  44. Traditional Parallel Architectures Definition and development tracks

  45. Defining a Parallel Architecture • A sequential architecture is characterized by: • Single processor • Single control flow path • Parallel architecture: • Multiple processors with interconnection network • Multiple control flow paths • Communication and synchronization • A parallel computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast 1-8

  46. Broad Issues in Parallel Architectures • Resource allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, communication and synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and scalability • how does it all translate into performance? • how does it scale? 1-8

  47. General Context: Multiprocessors • Multiprocessor is any computer with several processors • SIMD • Single instruction, multiple data • Modern graphics cards • MIMD • Multiple instructions, multiple data Lemieux cluster,Pittsburgh supercomputing center 1-8

  48. Architecture Developments Tracks • Multiple-Processor Tracks • Shared-memory track • Message-passing track • Multivector and SIMD Tracks • Multithreaded and Dataflow Tracks • Multi-core track 1-8

  49. Shared-Memory Track • Starts with C.mmp system developed at CMU in 1972 • UMA multiprocessor with 16 PDP 11/40 processors • Connected to 16 shared memory modules via crossbar switch • Pioneering multiprocessor OS (Hydra) development effort • Illinois Cedar (1987) • IBM RP3 (1985) • BBN Butterfly (1989) • NYU/Ultracomputer (1983) • Stanford/DASH (1992) • Fujitsu VPP500 (1992) • KSR1 (1990) 1-8

  50. Message-Passing Track • The Cosmic Cube (1981) pioneered message-passing computers • Intel iPSCs (1983) • Intel Paragon (1992) • Medium-grain multicomputers • nCUBE-2 (1990) • Mosaic (1992) • MIT/J Machine (1992) • Fine-grain multicomputers 1-8

More Related