The Trimedia CPU64 VLIW Media Processor

The Trimedia CPU64 VLIWMedia Processor Kees Vissers Philips Research Visiting Industrial Fellow vissers@eecs.berkeley.edu

Outline • Introduction • Application Domain • Processor Architecture • Retargetable compiler • Design Space Exploration • Results

SDRAM Serial I/O video-in PCI bridge video-out timers I2C I/O audio-out I$ VLIW cpu audio-in D$ Design Problem

Application Domain • High volume consumer electronics productsfuture TV, home theatre, set-top box, etc. • Embedded core • Media processing:audio, video, graphics, communication

Media Processing CPU Design considerations: • Cost (embedded in high volume products) • Performance for the application domain • Ease of use (programming model)  Benchmark suite needed

Application: Natural Motion D/2 n-1 -D/2 n-1/2 picture number n

Machine Description BenchmarkApplications Compiler Simulator Performance Numbers Project Approach: Y-chart

Benchmark Suite Characteristics • Each application: typical for a class of applications within application domain • The set covers a significant area of the application domain • Each benchmark is sufficiently well tuned to the architecture to measure its performance

The Benchmark Suite

Processing Characteristics • Signal ratesaudio, video (at block and pixel rate) • Basic data typesbyte (8), half-words (16), words (32), float (32) • Data access patternssample stream, bitstream, random access • Data independent and dependent load • Control processing and signal processing

Algorithm Reference C TM1000 optimized CPU64 optimized knowledge simulation IC exploration video video Application Development

Initial Design Considerations • Goal: 6-8 times TM1000 performance • Standard ANSI-C, reuse of code • Utilize instruction and data parallelism • Limited complexity (embedded core) • Compatibility through recompilation • VLIW architecture

Instruction Set Considerations • A machine operation must be sufficiently generic within the application domain • Sufficiently powerful operations • Limited number of operations • Consistency and orthogonality

7 7 TM1000 cycles x 10 CPU64 cycles x 10 Results: Natural Motion Averagegain: 2.7x(cycles) 15 15 UPC 10 10 SEG SEG instructions 5 5 MED UPC nops SAD SUB cache stalls MED SUB SAD 0 0 0 20 40 60 80 100% 0 20 40 60 80 100%

Natural Motion Dynamic Load 150 cpu load (%) tm1000 @ 125MHz 100 90% 50 cpu64 @ 300MHz 16% 0 10 20 30 40 50 60 70 80 90 100 110 Field number

Summary • Application domain: • Media processing for CE industry • Benchmark suite: • Targeted to application domain • Optimized for class of processors • Future products: • Gradual shift to software implementations of signal processing functions

Application Target • Processor core, to be embedded in different ICs and products • Real time processing of media streams • Cost-sensitive consumer electronics market

SDRAM Serial I/O video-in PCI bridge video-out timers I2C I/O audio-out I$ VLIW cpu audio-in D$ Embedded Application

Performance Target Relative to TriMedia TM1000 (5-slot VLIW,100MHz, 32-bit datapath, media operations): • 6x to 8x performance increase,to process more or higher-resolution video streams • Not more than double transistor count

Efficiency Good performance/silicon cost ratio by: • Optimize CPU architecture and media benchmark source code towards each other • Solve resource conflicts at compile time

Outline • Introduction • VLIW architecture • Instruction set • IDCT example • Results

Architectural Speedup Not simply increasing the VLIW issue width: • Diminishing gain of compiler-generated ILP • Increasing implementation complexity(area, timing)

Architectural Speedup • Extended SIMD: uniform 64-bit design,vectors of 1-, 2-, and 4-byte elements(data throughput per cycle) • New, extensive, media instruction set(functionality per cycle) • Improved cache control(prefetch, alloc)

32-bit peripheral bus 64-bit memory bus multi-port 128 words x 64 bits register file bypass network datacache16KB mmu mmu FU FU FU FU FU exceptions PC VLIW instruction decode and launch instruction cache 32 KB mmu CPU64 Architecture

Code Example int a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; RISC instructions: VLIW instructions (5 issueslots): 1 x = *a 1 x = *a y = *b i + = 1 a += 1 b += 12 y = *b 2 g = i<n nop nop nop nop3 i += 1 3 jmpf g 7 jmpt g 1 nop nop nop4 g = i<n 4 z = x+y nop nop nop nop5 z = x+y 5 *c = z c+=1 nop nop nop6 *c = z 6 nop nop nop nop nop7 jmpf g 128 jmpt g 19 a += 110 b += 111 c += 1

VLIW challenges • Achieve high Instruction Level Parallelism • Compiler needs to know the pipeline for instructions • Code size • Design of a multi-ported register file. • Design of the forwarding network

Instruction Level Parallelism • Loop unrolling, software pipelining • Guarded execution • Avoid branches • if conversion, • hand rewrite source!, valid in Embedded applications, done for TI, Philips and Apple software (Adobe fotoshop). • Speculate branches

Compiler • Detect data dependency at compile time: • examples: c[i]=a[i]+b[i]; potential dependency d[i]=a[i]+c[j]; c[i] might be c[j] c[1]=a[i]+b[i]; no dependency d[i]=a[i]+c[2]; c[1] is never c[2]

Compiler • Schedule the instructions in the proper slot at the proper time, e.g.: • load and stores only in slot 1,2 • adds in all slots • floating point multiply only in slot 3 • add take 1 cycle, mult takes 2 cycles • 3 branch delay slots

Code size • Large number of registers cost code size • add r1 r3 r127 • Large number of media instructions: code size • Use hardware compression techniques to remove nops • Variable length instruction format! • Vector instructions: SIMD, great for data processing performance

Code Example char a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; : VLIW instructions (5 issueslots): 1 x = *a y = *b i + = 1 a += 1 b += 1 2 g = i<n nop nop nop nop 3 jmpf g 7 jmpt g 1 nop nop nop 4 z = x+y nop nop nop nop 5 *c = z c+=1 nop nop nop 6 nop nop nop nop nop

Code Example Vector vec64sb a[n/8], b[n/8], c[n/8]; int i; for (i=0; i</8n; i++) c[i]=sb_add(a[i],b[i]); : VLIW instructions and SIMD (5 issueslots): 1 x = ld64b a y =ld64b b i + = 1 a += 8 b += 8 2 g = i<fraction n nop nop nop nop 3 jmpf g 7 jmpt g 1 nop nop nop 4 z = add64sb x y nop nop nop nop 5 st64b c z c+=1 nop nop nop 6 nop nop nop nop nop

x7 x6 x5 x4 x3 x2 x1 x0 y7 y6 y5 y4 y3 y2 y1 y0 - - - - - - - - |.| |.| |.| |.| |.| |.| |.| |.| S 0 z Vector Instruction Example Sum of absolute differences (“ub_me”)

Application Code Example int calc_sad(vec64ub *prv, vec64ub *cur, int s) { vec64ub left, right; int i; int sad = 0; for (i=0; i<8; i++) { left = prv[i*s]; right = cur[i*s]; sad += ub_me(left, right); } return(sad); }

Architecture VLIW+SIMD • Issues a 5-slot instruction every cycle • Each slot supports a selection of FUs • All FUs support vectorized (SIMD) data • Double-slot FU allows powerful multi-argument, multi-result operations • All FUs are pipelined, latency 1 to 4(except floating point divide and sqrt)

Instruction Format Per operation, per slot: • Up to 2 arguments, up to 1 result • Optionally an extra register for guarding (conditional execution) • Immediate argument size can be 32 bit Instructions have compressed variable length format, decompressed during decode

Operations Intends to cover combinations of: • 1-, 2-, 4-byte elements in an 8-byte vector • Signed or unsigned type • Clipped versus wrap-around arithmetic • Set of basic functions(ld, st, add, mul, compare, shift, ...)

Instruction Set

1 1 1 1 + + + + Example: msp-multiply Each argument: 4-way 16-bit int Multiply to internal double precision integer Round lower half into upper half Return upper half

a b c d e f g h i j k l m n o p a e i m b f j n Example: Transpose 2-slot ‘super-op’:takes 4 arguments, shuffles bytes,produces 2 results (2-dimensional filtering)

+ + + + + + + + + + + + + + + + Example: 4-way average add with extended precision 1 1 round (MPEG 2-dimensional half-pixel average)

Branches • Branch operations have 3 branch delay slots • Compiler & scheduler try to fill these(profiling, loop unrolling, inlining, guarding) • Branch units are properly pipelined • Up to 3 branch ops in 1 instruction • No branch prediction hardware • Branches are preferred moments for interrupt servicing

2D-IDCT Example • IDCT code was tested early in the project, also for studying the programming model • Mapped through our experimental C compiler and instruction scheduler • Operates entirely on (vectors of) 16-bit data elements • Simulation showed IEEE 1180 accuracy compliancy

OK OK OK OK 2D-IDCT Instructions Execution time in cycles, excluding data cache misses VLIW: CPU64, TM1000, TI 320C60 others superscalar = accuracy compliant

Status Now in construction at Philips Semiconductors, Sunnyvale • 7M transistors (target) • 300Mhz clock (target) • 0.18 technology • 1.8 volt

Conclusion Processor Architecture • The 5-slot 64-bit VLIW provides a powerful architecture for media processing • Performance gain of 6x to 8x on TM1000 • Rich and regular media instruction set • Powerful multi-argument, multi-result ‘super-op’ naturally fits VLIW architecture

App 1 Exe 1A Exe 1B App 2 Exe 2A Exe 2B App 3 Exe 3A Exe 3B Traditional Compilation Compiler A Machine A Compiler B Machine B

App 1 Exe 1A Exe 1B App 2 Exe 2A Exe 2B App 3 Exe 3A Exe 3B Retargetable Compilation Machine ADescription Machine A Compiler Machine B Machine BDescription

The Trimedia CPU64 VLIW Media Processor