1.24k likes | 1.5k Views
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 1 (Overview). Course Objectives. A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems Emphasis on programming:
E N D
Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lecture 1 (Overview)
Course Objectives • A hands-on opportunity to learn: • Multi-core architectures; and • Programming multi-core systems • Emphasis on programming: • Using multi-threading paradigm • Understand the complexities • Apply to generic computing/networking problems • Implement on an popular embedded multi-core platform 1-8
Grading Policy and Reference Books • Grading Policy • Lectures (40%) • Labs (50%) • Quizzes (daily) (10%) • Reference material: • Shameem Akhtar and Jason Roberts,Multi-Core Programming, Intel Press, 2006 • David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 • Class notes 1-8
Course Outline • Introduction • Parallel architectures and terminology • Context for current interest: multi-core processors • Programming paradigms for multi-core • Octeon processor architecture • Multi-threading on multi-core processors • Applications for multi-core processors • Application layer computing on multi-core • Performance measurement and tuning 1-8
An Introduction to Parallel Computing in the context of Multi-Core Architectures
Developing Software for Multi-Core: A Paradigm Shift • Application developers are typically oblivious of underlying hardware architecture • Sequential program • Automatic/guaranteed performance benefit with processor upgrade • No work on the programmer • No “free lunch” with multi-core systems • Multiple cores in modern processors • Parallel programs needed to exploit parallelim • Parallel computing is now part of main-stream 1-8
Parallel Computing for Main-Stream: Old vs. New Programming Paradigms • Known tools and techniques: • High performance computing and communication (HPCC) • Wealth of existing knowledge about parallel algorithms, programming paradigms, languages and compilers, and scientific/engineering applications • Multi-threading for multi-core • Common in desktop and enterprise applications • Exploits parallelism of multi-core with its challenges • New realizations of old paradigms: • Parallel computing on Playstation 3 • Parallel computing on GPUs • Cluster computing for large volume data 1-8
Dealing with the Challenge of Multi-Core Programming with Hands-On Experience • Our objective is two-fold • Overview the known paradigms for background • Learn using the state-of-the-art implementations • Choice of platform for hands-on experience • Cavium Networks’ Octeon processor based system • Multiple cores (1 to 16) • Suitable for embedded products • Commonly used in networking products • Standard Linux based development environment 1-8
Agenda for Today • Parallel architectures and terminology • Processor technology trends • Architecture trends • Taxonomy • Why multi-core architectures? • Traditional parallel computing • Transition to multi-core architectures • Programming paradigms • Traditional • Recent additions • Introduction to Octeon processor based systems 1-8
Architecture and Terminology Background of parallel architectures and commonly used terminology
Architectures and Terminology • Objectives of this section: • Understand the processor technology trends • Realize that parallel architectures evolve based on technology and architecture trends • Terminology used in parallel computing • Von Neumann • Flynn’s taxonomy • Bell’s taxonomy • Other commonly used terminology 1-8
Processor Technology Trends Processor technology evolution, Moore’s law, ILP, and current trends
Processor Technology Evolution • Increasing number of transistors on a chip • Moore’s law: number of transistors on a chip is expected to double every 18 months • Chip densities are reaching their physical limits • Technological breakthroughs have kept Moore’s law alive • Increasing clock rates during 90’s • Faster and smaller transistors, gates, and circuits on a chip • Clock rates of microprocessors increase by ~30% per year • Benchmark (e.g., SPEC suite) results indicate performance improvement with technology 1-8
Moore’s Law • Gordon Moore , Founder of Intel • 1965: since the integrated circuit was invented, the number of transistors/inch2 in these circuits roughly doubled every year this trend would continue for the foreseeable future • 1975: revised - circuit complexity doubles every 18 months • This was simply a prediction • Based on little data • However, it has defined the processor industry 1-8
Moore’s Original Law (2) ftp://download.intel.com/research/silicon/moorespaper.pdf 1-8
Moore’s Original Issues • Design cost still valid • Power dissipation still valid • What to do with all the functionality possible ftp://download.intel.com/research/silicon/moorespaper.pdf 1-8
Moore’s Law and Intel Processors From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif 1-8
Good News: Moore’s Law isn’t done yet Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8
Worse News:Power (normalized to i486) Trend Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8
Addressing Power Issues Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8
Architecture Optimized for Power:a big step in the right direction Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8
Long term solution: Multi-Core Source: Webinar by Dr. Tim Mattson, Intel Corp. 1-8
Summary of Technology Trends • Moore’s law is still relevant • Need to deal with related issues • Design complexity • Power consumption • Uniprocessor performance is slowing down • Multiple processor cores resolve these issues • Parallelism at hardware level • End user is exposed to it • Added complexities related to programming such systems 1-8
Taxonomy for Parallel Architectures Von Neumann, Flynn, and Bell’s taxonomies and other common terminology
Von Neumann Architecture Evolution Von Neumann architecture Scalar Sequential Lookahead I/E overlap Functional parallelism Multiple func.units Pipeline Implicit vector Explicit vector Memory to memory Register to register SIMD MIMD Associative processor Processor array Multicomputer Multiprocessor Massively Parallel Processors 1-8
Pipelining and Parallelism • Instructions prefetch to overlap execution • Functional parallelism supported by: • Multiple functional units • Pipelining • Pipelining • Pipelined instruction execution • Pipelined arithmetic computations • Pipelined memory access operations • Pipelining is attractive for performing identical operations repeatedly over vector data strings 1-8
Flynn’s Classification • Michael Flynn classified architectures in 1972 based on instruction and data streams • Single Instruction stream over a Single Data stream (SISD) • Conventional sequential machines • Single Instruction stream over Multiple Data streams (SIMD) • Vector computers are equipped with scalar and vector hardware 1-8
Flynn’s Classification (2) • Multiple Instruction streams over Single Data stream (MISD) • Same data flowing through a linear array of processors • Aka systolic arrays for pipelined execution of algorithms • Multiple Instruction streams over Multiple Data streams (MIMD) • Suitable model for general purpose parallel architectures 1-8
Multicomputers Multiple address spaces System consists of multiple computers, called nodes Nodes are interconnected by a message-passing network Each node has its own processor, memory, NIC, and I/O devices Multiprocessors Shared address space Further classified based on how memory is accessed Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) Cache-Only Memory Access (COMA) Cache-Coherent Non-Uniform Memory Access (cc-NUMA) Bell’s Taxonomy for MIMD 1-8
Multicomputer Genenrations • First generation (1983-87) • Processor boards connected in hypercube architecture • Software-controlled message switching • Examples: Caltech Cosmic, Intel iPSC/1 • Second generation (1988-1992) • Mesh connected architecture • Hardware message routing • Software environment for medium grain distributed computing • Example: Intel Paragon • Third generation (1993-1997) • Fine grain multicomputers • Examples: MIT J-Machine and Caltech Mosaic 1-8
Multiprocessor Examples • Distributed memory (scalable) • Dynamic binding of address to processors (KSR) • Static binding, caching (Alliant, DASH) • Static program binding (BBN, Cedar) • Central memory (not scalable) • Cross-point or multi-stage (Cray, Fujitsu, Hitachi, IBM, NEC, Tera) • Simple multi bus (DEC, Encore, NCR, Sequent, SGI, Sun) 1-8
Supercomputers • Supercomputers use vector processing and data parallelism • Classified into two categories • Vector supercomputers • SIMD supercomputers • SIMD machines with massive data parallelism • Instruction is broadcast to large number of Pes • Examples: Illiac (64 PEs), MasPar MP-1 (16,384 PEs), and CM-2 (65,538 PEs) 1-8
Vector supercomputers • Machines with powerful vector processors • If decoded instruction is a vector operation, it is sent to vector unit • Register-register architecture: • Fujitsu VP2000 series • Memory-to-memory architecture: • Cyber 205 • Pipelined vector supercomputers: • Cray Y-MP 1-8
Dataflow Architectures • Represent computation as a graph of essential dependences • Logical processor at each node, activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution 1-8
Systolic Architectures • Replace single processor with array of regular processing elements • Orchestrate data flow for high throughput with less memory access 1-8
Systolic Architectures (2) • Different from pipelining • Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips • Represent algorithms directly by chips connected in regular pattern 1-8
´ ´ ´ ´ y ( i ) = w 1 x ( i ) + w 2 x ( i + 1) + w 3 x ( i + 2) + w 4 x ( i + 3) x 8 x 6 x 4 x 2 x 3 x 1 x 7 x 5 w 4 w 3 w 2 w 1 y 3 y 2 y 1 x in x out x out = x x x = x in ´ y out = y in + w x in w y in y out Systolic Arrays (Cont’d) • Example: Systolic array for 1-D convolution • Practical realizations (e.g. iWARP) use quite general processors • Enable variety of algorithms on same hardware • But dedicated interconnect channels • Data transfer directly from register to register across channel • Specialized, and same problems as SIMD • General purpose systems work well for same algorithms (locality etc.) 1-8
Cluster of Computers • Started as a poor man’s parallel system • In-expensive PCs • In-expensive switched Ethernet switch • Run-time system to support message-passing • Low performance for HPCC applications • High network I/O latency • Low bandwidth • Suitable for high throughput applications • Data center applications • Virtualized resources • Independent threads or processes 1-8
Summary of Taxonomy • Multiple taxonomies • Based on functional parallelism • Von Neumann and Flynn’s taxonomies • Based on programming paradigm • Bell’s taxonomy • Parallel architecture types • Multi-computers (distributed address space) • Multi-processors (shared address space) • Multi-core • Multi-threaded • Others: vector, data flow, systolic, and cluster 1-8
Why Multi-Core Architectures Based on technology and architecture trends
Multi-Core Architectures • Traditional architectures • Sequential Moore’s law = increasing clk freq • Parallel Diminishing returns from ILP • Transition to multi-core • Architecture similar to SMPs • Programming typically SAS • Challenges to transition • Performance = efficient parallelization • Selecting a suitable programming paradigm • Performance tuning 1-8
Traditional Parallel Architectures Definition and development tracks
Defining a Parallel Architecture • A sequential architecture is characterized by: • Single processor • Single control flow path • Parallel architecture: • Multiple processors with interconnection network • Multiple control flow paths • Communication and synchronization • A parallel computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast 1-8
Broad Issues in Parallel Architectures • Resource allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, communication and synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and scalability • how does it all translate into performance? • how does it scale? 1-8
General Context: Multiprocessors • Multiprocessor is any computer with several processors • SIMD • Single instruction, multiple data • Modern graphics cards • MIMD • Multiple instructions, multiple data Lemieux cluster,Pittsburgh supercomputing center 1-8
Architecture Developments Tracks • Multiple-Processor Tracks • Shared-memory track • Message-passing track • Multivector and SIMD Tracks • Multithreaded and Dataflow Tracks • Multi-core track 1-8
Shared-Memory Track • Starts with C.mmp system developed at CMU in 1972 • UMA multiprocessor with 16 PDP 11/40 processors • Connected to 16 shared memory modules via crossbar switch • Pioneering multiprocessor OS (Hydra) development effort • Illinois Cedar (1987) • IBM RP3 (1985) • BBN Butterfly (1989) • NYU/Ultracomputer (1983) • Stanford/DASH (1992) • Fujitsu VPP500 (1992) • KSR1 (1990) 1-8
Message-Passing Track • The Cosmic Cube (1981) pioneered message-passing computers • Intel iPSCs (1983) • Intel Paragon (1992) • Medium-grain multicomputers • nCUBE-2 (1990) • Mosaic (1992) • MIT/J Machine (1992) • Fine-grain multicomputers 1-8