330 likes | 446 Views
Ch. 8: Multiprocessor architectures with a single bus. P_3. P_4. P_5. SDRAM. I$. D$. P_1. P_2. single external memory for cost reasons. Embedded CPU. P1…5 : progr. DSPs, ADSPs or ASPs (with local programs). Example: TCP chip (TV controller). PR3930 + peripherals
E N D
Ch. 8: Multiprocessor architectures with a single bus P_3 P_4 P_5 SDRAM I$ D$ P_1 P_2 single external memory for cost reasons. Embedded CPU P1…5 : progr. DSPs, ADSPs or ASPs (with local programs) Embedded MM Systems on Silicon-8 J. van Meerbergen
Example: TCP chip (TV controller) • PR3930 + peripherals • Gfx, SDRAM controller, • Serial interconnect bus, • I2C, UART, timers • PI bus architecture • 80 mm2 • 352 pins • 0.35 micron process • 48 MHz (96 for gfx) D$ I$ Embedded MM Systems on Silicon-8 J. van Meerbergen
Advantages and Disadvantages • Advantages • task level parallelism • efficient solutions • processors can be optimised for specific tasks • reuse of IP blocks (Intellectual Property) • standard bus interfaces (PI bus) • simple solution (KISS heuristic) • off-chip memory in an optimized memory process • Disadvantage • bandwidth to external memory • bus + memory interface are a central bottleneck Embedded MM Systems on Silicon-8 J. van Meerbergen
Outline • memories: trend towards SDRAM • internal busses: PI bus as an example • communication protocol • example: H263 application Embedded MM Systems on Silicon-8 J. van Meerbergen
data/bit line BL (column) Row address decoder Row driver Word line (row) ... Row driver ... ... ... ... ... ... Row driver ... Address buffer Sense Amplifier Sense Amplifier Sense Amplifier ... • precharge BL • decode row • discharge BL • sense amp • column select • data > out address Column address decoder and data multiplexer RAS CAS W/R Clock generator Data I/O Embedded MM Systems on Silicon-8 J. van Meerbergen
Column address cycle time RAS CAS (b) page mode address --- c2 --- c3 --- --- row c1 Data out d1 d2 d3 acces cycle time RAS CAS (a) random access mode address --- row col --- row --- Data out data Random acces time Embedded MM Systems on Silicon-8 J. van Meerbergen
RAS CAS address c1 c2 c3 c4 c5 c6 c7 --- r Data out d1 d2 d3 d4 d5 d6 d7 (c) static column mode Embedded MM Systems on Silicon-8 J. van Meerbergen
clk cmnd bank addr data Synchronous DRAM (SDRAM) • introduce a clock (asynchronous => synchronous) • pipelining 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 READ ACT READ PRE ACT READ ROW COL ROW COL COL da1 db1 db2 db3 db4 dc1 Hi-Z CAS latency 1/2/3 for 33/66/99 MHz Burst length = 1/2/4/8/full page Embedded MM Systems on Silicon-8 J. van Meerbergen
clk cmnd bank addr data Synchronous DRAM (SDRAM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 WRIT ACT WRIT PRE ACT WRIT ROW COL ROW COL COL dc3 da1 db1 db2 db3 db4 dc1 dc2 Hi-Z • Final goal = 1 access each clock cycle • banking • burst length large enough (e.g. 8) Embedded MM Systems on Silicon-8 J. van Meerbergen
clk cmnd bank addr data Synchronous DRAM (SDRAM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 READ ACT READ PRE ACT READ ROW COL ROW COL COL da1 db1 db2 db3 db4 dc1 db5 db6 db7 db8 CAS latency Embedded MM Systems on Silicon-8 J. van Meerbergen
clk cmnd bank addr data Synchronous DRAM (SDRAM) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 READ ACT READ PRE ACT READ ROW COL ROW COL COL dc3 da1 db1 db2 db3 db4 dc1 dc2 db5 db6 db7 db8 Embedded MM Systems on Silicon-8 J. van Meerbergen
column Bank A Bank B row 0 --------------7 8------------------15 Embedded MM Systems on Silicon-8 J. van Meerbergen
PI Bus (Peripheral Interconnect) • example of an on-chip bus • goals: low cost (parametrisable), • medium performance, • simple protocol • originally developped within a European OMI project • further developments: PI, Amba (Arm) Embedded MM Systems on Silicon-8 J. van Meerbergen
PI Bus Features • (single edge) clock synchronous operation • separate scalable address, data busses • multiple bus masters • flexible bus ownership arbitration scheme; default grant mechanism • chaining of bus operations • wide range of bus operations: byte, halfword, word, 2|4|8|16 word block • pipelining of bus operations • peak transfer rate during chained block transfers: 1 data object / clock cycle (e.g. 200 Mbyte/s @ 50 MHz, 32 bit) • bus deadlock prevention (timeout) Embedded MM Systems on Silicon-8 J. van Meerbergen
Components (1) • Bus Master Agent(s) • A bus agent that initiates communication via PI-Bus. It issues bus operations once bus ownership has been granted. • Bus Slave Agent(s) • A bus agent that responds to bus operations on PI-Bus when it is selected as target of a bus operation • Bus Cache Agent(s) • optional • A bus agent including a cache memory that keeps track of PI-Bus communication between a bus master and a bus slave. It takes appropriate measures to ensure consistency of its local cache contents with memory addressed by the bus operation. The same block can be master, slave and cache. Minimum configuration: 1 slave and 1 master Embedded MM Systems on Silicon-8 J. van Meerbergen
Components (2) • Bus Control Unit • Bus control components that are required for PI-Bus functionality: • arbitration and assignment of bus ownership to bus masters (system specific! ) • selection of bus slaves (system specific! ) • generation of “bus error” on an address that is not mapped to a bus slave • generation of “timeout” on dead lock • initiation of memory coherency protocol (system specific! ) Embedded MM Systems on Silicon-8 J. van Meerbergen
Components in a PI-Bus System Master Master Slave Cache Cache Master Slave BUS Control Unit “Unlimited” number of: bus masters bus slaves bus caches Slave Slave BCU is system specific Master, slave and cache functionality can be combined Embedded MM Systems on Silicon-8 J. van Meerbergen
Bus Master D[m:0] Bus Slave ACK[2:0] OPC[4:0], READ, A[n,2], LOCK TOUT CLK CLK REQx SELy Bus Control Unit RESETN RESETN GNTx CLK RESETN Embedded MM Systems on Silicon-8 J. van Meerbergen
1. Phase 1: REQx => BCU GNTx => master 2. Phase 2: SELy => slave master sends A[n:2] and OPC[4:0] and READ Embedded MM Systems on Silicon-8 J. van Meerbergen
3. phase 3: data transfer slave sends ACK[2:0] Embedded MM Systems on Silicon-8 J. van Meerbergen
1 2 3 4 5 6 CLK driven Undriven, previous logic state weakly held OPC, LOCK, A, READ SEL Dread ACK, Dwrite ACK=WAT ACK=RDY Address cycle data cycle data cycle Bus Operation Embedded MM Systems on Silicon-8 J. van Meerbergen
1 2 3 4 5 6 CLK Op1 LOCK=1 Op2 LOCK=0 OPC, LOCK, A, READ 0 SEL ACK, D ACK=WAT ACK=RDY ACK=RDY Address cycle Address/ data cycle data cycle Address/ data cycle 2 Bus Operations Embedded MM Systems on Silicon-8 J. van Meerbergen
Complexity: slave = 0.5 Kgates = 0.05 sq. mm. master = 2Kgates = 0.2 sq. mm 0.35 micron Clock speed : limited by coupling capacitance and increased resistance Embedded MM Systems on Silicon-8 J. van Meerbergen
PI-Bus Hierarchy Processors Memory (Interfaces) Peripherals System Functions Bridge BUS Control Unit BUS Control Unit Embedded MM Systems on Silicon-8 J. van Meerbergen
PI bus interrupt architecture interrupt controller Interrupt sources (IS) request_1 IC processor acknowledge_1 request_2 interrupt request acknowledge_2 request_3 IS_1 IS_2 IS_3 acknowledge_3 readint_vector PI-bus Embedded MM Systems on Silicon-8 J. van Meerbergen
interrupt controller Interrupt sources (IS) IC processor request chain interrupt request acknowledge chain IS_1 IS_2 IS_3 readint_vector PI-bus Difference between on-chip and off-chip busses Embedded MM Systems on Silicon-8 J. van Meerbergen
2 C_1 CPU C_2 C_20 . . . mem 3 1 4 Communication protocol Processors communicate via SDRAM under control of the CPU • from CPU to processor : memory mapped communication • from processor to CPU : interrupt Embedded MM Systems on Silicon-8 J. van Meerbergen
Communication protocol: discussion • Advantage : simple solution • Disadvantage : • CPU overloaded with synchronisation requests (must be kept smaller than 1Khz typically) • consequence : grainsize of tasks must be sufficiently large ( line = 16kHz, stripe = 2kHz, frame = 50Hz) • halves the bandwidth • mapping problems: central resource limits scalability, specially with real-time constraints • Alternative: polling instead of interrupt (=active wait) • Disadvantage: keeps CPU busy Embedded MM Systems on Silicon-8 J. van Meerbergen
225 MHz bus + TM core 64 KB I$ Cache line size = 64 B Latency = 70 cycles ISR trashes 10% of the I$ pSOS send + taskswitch + receive = 6000 cycles Assume 20 video tasks and frame rate 60 Hz ? How much CPU time (in %) is used for synchronisation Latency for 1 task switch pSOS send + taskswitch + receive = 6000 cycles cache trashing: assume 10% of 6.4 KB = 100 cache lines * 70 cycles = 7000 cycles som = 13 000 cycles = 58 us Assume 20 video tasks and frame rate 60 Hz = 1200 switches per sec = 69 msec = 0.069 = 7 % of cpu-load Embedded MM Systems on Silicon-8 J. van Meerbergen
Example: H.263 video encoder out VLC in DCT Q IQ IDCT + + - + Motion est. Frame store Pred SQCIF 96*128 px 10 Hz 100 mW Embedded MM Systems on Silicon-8 J. van Meerbergen
PR3940 I$ D$ memory 10 Hz => 140 MHz CPU Embedded MM Systems on Silicon-8 J. van Meerbergen
Encode (predict, DCT,Q) Decode (IQ, IDCT recon) SAD SDRAM I$ D$ Video in Video out Embedded CPU Embedded MM Systems on Silicon-8 J. van Meerbergen
20 Mips 80 mW Embedded MM Systems on Silicon-8 J. van Meerbergen