Introduction to Embedded Systems

Introduction to Embedded Systems Rabie A. Ramadan rabieramadan@gmail.com http://www.rabieramadan.org/classes/2014/embedded/ 2

Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms. Topics

Embedded processors account for Over 97% of total processors sold Sales expected to increase by roughly 15% each year Demand for Embedded Processors

Performance Latency : the time required to execute an instruction from start to finish, Throughput : the rate at which instructions are finished Evaluating Processors

At the program level, computer architects also speak of average performance or peak performance. Often calculated assuming that instruction throughput proceeds at its maximum rate and all processor resources are fully utilized Evaluating Processors

Embedded system designers often talk about program performance in terms of worst-case (or sometimes best-case) performance: This is not simply a characteristic of the processor; it is determined for a particular program running on a given processor. Evaluating Processors

Cost The purchase price of the processor. In VLSI design, cost is often measured in terms of the silicon area required to implement a processor, which is closely related to chip cost. Evaluating Processors

Energy and power In modern processors, energy and power consumption must be measured for a particular program and data for accurate results. Evaluating Processors

Predictability Important characteristic for embedded systems When designing real-time systems, we want to be able to predict execution time. More difficult to measure. Evaluating Processors

Security An important characteristic of all processors, including embedded processors. Security is inherently unmeasurable because of the fact that we do not know of a successful attack on a system; this does not mean that such an attack cannot exist. Evaluating Processors

Von Neumann Architecture Basic Computer Architecture Memory instruction data Input unit Output unit ALU Processor CU Reg.

Bit level parallelism Within arithmetic logic circuits Instruction level parallelism Multiple instructions execute per clock cycle Memory system parallelism Overlap of memory operations with computation Operating system parallelism More than one processor Multiple jobs run in parallel Loop level Procedure level Levels of Parallelism

Bit Level Parallelism Within arithmetic logic circuits Levels of Parallelism

Instruction Level Parallelism (ILP) Multiple instructions execute per clock cycle Pipelining (instruction - data) Multiple Issue -Very long instruction word (VLIW) Levels of Parallelism

Memory System Parallelism Overlap of memory operations with computation Levels of Parallelism

Operating System Parallelism There are more than one processor Multiple jobs run in parallel Loop level Procedure level Levels of Parallelism

Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Flynn’s Taxonomy

Von Neumann Architecture Single Instruction stream - Single Data stream (SISD) Memory instruction data ALU CU Processor

Instructions of the program are broadcast to more than one processor Each processor executes the same instruction synchronously, but using different data Used for applications that operate upon arrays of data Single Instruction stream - Multiple Data stream (SIMD) data PE data PE instruction CU Memory data PE data PE instruction

Each processor has a separate program An instruction stream is generated for each program on each processor Each instruction operates upon different data Multiple Instruction stream - Multiple Data stream (MIMD)

Shared memory Distributed memory Multiple Instruction stream - Multiple Data stream (MIMD)

Distributed memory Each processor has its own local memory Message-passing is used to exchange data between processors Shared memory Single address space All processes have access to the pool of shared memory Shared vs Distributed Memory P P P P Bus Memory M M M M P P P P Network

Processors cannot directly access another processor’s memory Each node has a network interface (NI) for communication and synchronization Distributed Memory M M M M P P P P NI NI NI NI Network

Each processor executes different instructions asynchronously, using different data Distributed Memory instr data M CU PE data data data data data instr M CU PE Network data instr M CU PE data instr M CU PE

Each processor executes different instructions asynchronously, using different data Shared Memory data CU PE data CU PE Memory data CU PE data CU PE instruction

Uniform memory access (UMA) Each processor has uniform access to memory (symmetric multiprocessor - SMP) Non-uniform memory access (NUMA) Time for memory access depends on the location of data Local access is faster than non-local access Easier to scale than SMPs P P P P P P P P Bus Bus Memory Memory Shared Memory P P P P Bus Memory Network

Making the main memory of a cluster of computers look as if it is a single memory with a single address space Shared memory programming techniques can be used Distributed Shared Memory

Many general purpose processors GPU (Graphics Processor Unit) GPGPU (General Purpose GPU) Hybrid Multicore Systems Memory • The trend is: • Boardcomposed ofmultiple many core chipssharingmemory • Rack composedof multipleboards • A room full of these racks

RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple-issue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories. Other axes of comparison

Complex Instruction Set Computer “High level” Instruction Set Executes several “low level operations” Ex: load, arithmetic operation, memory store –  VAX, Intel X86, IBM 360/370, etc. RISC vs. CISC

Features of CISC Small number of general purpose registers Instructions take multiple clocks to execute Few lines of code per operation

Reduced Instruction Set Computer RISC is a CPU design that recognizes only a limited number of instructions Simple instructions Instructions are executed quickly MIPS, DEC Alpha, SUN Sparc, IBM 801 RISC vs. CISC

“Reduced” instruction set Executes a series of simple instruction instead of a complex instruction Instructions are executed within one clock cycle Incorporates a large number of general registers for arithmetic operations to avoid storing variables on a stack in memory Pipelining = speed Features of RISC

Instruction issue width important aspect of processor performance. Processors that can issue more than one instruction per cycle generally execute programs faster. They do so at the cost of increased power consumption and higher cost. Single issue versus Multiple issue

Static scheduling instructions is determined when the program is written. Dynamic scheduling determines which instructions are issued at runtime. Superscalar is a common technique for dynamic instruction issue -Tomasulo static versus dynamic scheduling

Embedded processors may be customized for a category of applications. Customization may be narrow or broad. We may judge embedded processors using different metrics: Code size. Energy efficiency. Memory system performance. Predictability. Embedded vs. general-purpose processors

RISC processors often have simple, highly-pipelinable instructions Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage pipeline. ARM9 has 5-stage pipeline ARM11 has 8-stage pipeline. Embedded RISC processors

ARM: ARM7 has in-order execution, and no memory management or branch prediction; ARM9 ARM11 has out of order execution, memory management, and branch prediction, MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security. PowerPC: PowerPC 400 series includes several embedded processors; Motorola and IBM offer superscalar versions of the PowerPC RISC processor families

Embedded DSP Processors • Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms

AT&T DSP-16 was the first DSP it had an onboard multiplier and provided a multiply–accumulate instruction. dest = src1*src2 + src3, a common operation in digital signal processing. Based on Harvard-architecture with separate data and instruction memories. Data accesses could rely on consistent bandwidth from the memory, which is particularly important for sampled-data systems. Embedded DSP Processors- example

Static: Use compiler to analyze program. Simpler CPU. Can’t depend on data values. Very Long Instruction Word (VLIW) Dynamic: Use hardware to identify opportunities. More complex CPU. Can make use of data values. Superscalar Parallelism extraction

Widespread use in embedded systems provide instruction-level parallelism with relatively low hardware overhead. The execution unit includes a pool of function units connected to a large register file. the execution unit reads a packet of instructions—each instruction in the packet can control one of the function units in the machine. Very Long Instruction Word (VLIW)

Large register file feeds multiple function units. Simple VLIW architecture E box Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP Register file ALU ALU Load/store Load/store FU

Clustered VLIW architecture • Register file, function units divided into clusters. Cluster bus Execution Execution Register file Register file

Example 1 : Trimedia family of processors designed for use in video systems. Video algorithms often perform similar operations on several pixels at time. Very Long Instruction Word (VLIW)

Example 2 : Texas Instruments C6x VLIW DSP Very Long Instruction Word (VLIW)

Onboard program and a data RAM as well as standard devices and DMA. The processor core includes two clusters, each with the same configuration. Each register file holds 16 words. Each data path has eight function units: two load units, two store units, two data address units, and two register file cross paths. Very Long Instruction Word (VLIW)Example 2: Texas Instruments C6x VLIW DSP

more than one instruction per clock cycle. Unlike VLIW processors, they check for resource conflicts on-the-fly to determine which combinations of instructions can be issued at each step. Superscalar processors are not as common in the embedded world. Used to some extent in embedded processors. Embedded Pentium is two-issue in-order. Some PowerPCs are superscalar Superscalar Processors

Introduction to Embedded Systems