410 likes | 607 Views
Ch. 5: Application domain specific processors (ADSP or ASIP). DSP. Programmable CPU. Programmable DSP. Application domain specific . Application specific processor. flexibility. efficiency. implementation. Appl. domain. GP. ADSP. Appl. domain.
E N D
Ch. 5: Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency Embedded MM Systems on Silicon-5 J. van Meerbergen
implementation Appl. domain GP ADSP Appl. domain implementation Application domain specific processors (ADSP or ASIP) • takes a well defined application domain as a starting point • exploits characteristics of the domain (computation kernels) • still programmable within the domain • e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ... performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume) problems - specification manual design, - design time and effort large effort => synthesized cores Embedded MM Systems on Silicon-5 J. van Meerbergen
www.adelantetech.com Embedded MM Systems on Silicon-5 J. van Meerbergen
Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral 2 /A|RT designer) • instructive demo (Adelante) • application examples • low power aspects (Mistral 2 /A|RT designer) • discussion • conclusion Embedded MM Systems on Silicon-5 J. van Meerbergen
OK? more appl.? Design process processor- model e.g. VLIW with shared RFs application(s) instance parameters 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw SW (code generation) HW design Estimations nsec/cycle, area, power/instr Estimations cycles/alg occupation Fast, accurate and early feedback no yes yes no go to phase 2 Embedded MM Systems on Silicon-5 J. van Meerbergen
Problem statement A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Embedded MM Systems on Silicon-5 J. van Meerbergen
Problem statement Algorithm spec Processor spec (instance) in ch 4 this is part of the code generator FE ISE CDFG GRTP Code Generation Machinecode Embedded MM Systems on Silicon-5 J. van Meerbergen
Example: Simple processor [Leupers] I.(12:5) Inp RAM I.(20:13) I.(4) +1 PC I.(3:2) IM I.(1:0) I.(20:0) REG outp Embedded MM Systems on Silicon-5 J. van Meerbergen
Example: Simple processor [Leupers] Embedded MM Systems on Silicon-5 J. van Meerbergen
ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) • Differences with VLIW processors of ch. 4 • 1. // FUs • ASUs = complex appl. Spec. FUs (beyond subword //) • e.g. biquad, median, DCT etc … • larger grainsize, more heterogeneous, more pipelines • 2. Rfiles • many Rfiles (>5 vs 1 or 2) • limited # ports (3 vs 15) • limited size (<16 vs. 128) • 3. Issue slots • all in parallel vs. 5 Embedded MM Systems on Silicon-5 J. van Meerbergen
RF5 RF7 RF6 RF8 RF1 RF3 RF2 RF4 FU3 FU4 FU1 FU2 flags IR3 IR4 IR1 IR2 Instruction memory Con- trol Embedded MM Systems on Silicon-5 J. van Meerbergen
read address RF 1 control FU mux 1 write address RF 1 read address RF 2 mux 2 write address RF 2 output drivers ASIP/VLIW architectures • Additional characteristics of the A|RT designer template • interconnect network: busses + input multiplexers • mux control is part of the instruction • control can change every clock cycle • network can be incomplete • busses can be merged • memories are modeled as FUs • separate data in and data out • 2 inputs (data in and address) and 1 output • Each FU can generate one or more flags • instruction format (per issue slot) Embedded MM Systems on Silicon-5 J. van Meerbergen
19 10 0 9 mux 2 mux 3 read RF1 write RF1 read RF2 write RF2 ALU instr. read RF3 write RF3 read RF4 write RF4 MAC instr. ASIP/VLIW architectures: example RF1 RF2 RF3 RF4 ALU MAC bus1 bus2 Embedded MM Systems on Silicon-5 J. van Meerbergen
ASIP/VLIW architectures : example Embedded MM Systems on Silicon-5 J. van Meerbergen
OK? assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) ASIP/VLIW architectures: design flow Algorithm spec Datapath synthesis RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 Change pragmas RTs Controller synthesis Estimations area, power, timing no VLIW makes relatively simple code selection possible yes Embedded MM Systems on Silicon-5 J. van Meerbergen
* + * * + * * * * + 0 0 1 2 3 1 2 3 1 3 1 2 * * * * * * 1 1 5 3 4 3 4 4 * + * + + 2 2 3 6 3 6 6 + * + * * * * + * 3 3 7 5 7 8 5 8 8 7 8 * + * * * * * 4 4 9 10 5 9 5 9 5 * + * + 9 10 9 10 ASIP/VLIW architectures: list scheduling Candidate Conflict & Scheduled IPB LIST Priority Comp. Operation * 4 OPB MULT ALU IPB OPB 5 Embedded MM Systems on Silicon-5 J. van Meerbergen
ASIP/VLIW architectures: feedback Embedded MM Systems on Silicon-5 J. van Meerbergen
Outline • design process • retargetable code generation (problem statement) • ASIP/VLIW architectures (Mistral 2 /A|RT designer) • instructive demo (Adelante) • application examples • low power aspects (Mistral 2 /A|RT designer) • discussion • conclusion Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter Minimizes the difference between x and e (reference signal) x y filter c0 c1 c63 Control unit - r e • Many applications are possible • echo cancelling for TV • e = flyback signal (known without echoes) • automatic equalization of cables in data transmission • acoustic echo cancelling Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter speech x speaker y filter c0 c1 c63 microphone r Control unit - Speech + noise e noise Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter noise (e.g. radio) Hearing aid x y filter c0 c1 c63 r Control unit - Speech + noise e speech Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter x[n] x[n-1] x[n-i] x[n-63] Z-1 Z-1 Z-1 c0 c1 ci c63 A0 * * An * A1 * Ai t[n] S63[n] S0[n] S1[n] Si[n] * ê [n] mu + Z-1 r[n] - e[n] Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter x[n-i] Ai Ci[n] Ci[n-1] Z-1 * + t[n] Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter sum[i] t r x@i r * c[i]@1 + * w + sum[i+1] Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter implementation 1 2 1 1 1 2 2 2 3 RAM ALU MULT ACU ROM bus1 bus2 266 clock cycles 1.1 mm2 Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter implementation 2 4 1 5 2 5 5 RAM ALU ACU ROM bus1 bus2 2250 clock cycles 0.7 mm2 Embedded MM Systems on Silicon-5 J. van Meerbergen
Application examples: adaptive filter implementation 3 1 2 2 2 1 1 1 2 1 1 1 1 RAM1 ACU1 ALU MULT RAM2 ROM ACU2 202 clock cycles 1.4 mm2 Embedded MM Systems on Silicon-5 J. van Meerbergen
clock cycles 2000 1000 area (mm2) 1 2 Embedded MM Systems on Silicon-5 J. van Meerbergen
Outline • design process • retargetable code generation (problem statement) • ADSP/VLIW architectures (Mistral 2 /A|RT designer) • instructive demo (Adelante) • application examples • low power aspects (Mistral 2 /A|RT designer) • discussion • conclusion Embedded MM Systems on Silicon-5 J. van Meerbergen
Implementation Independent Design Database Low power aspects • Estimation area + speed power Mistral2 Estimation Database Architecture Embedded MM Systems on Silicon-5 J. van Meerbergen
GSM viterbi decoder : default solution EXU ACTIV AREA POWER alu_1 96% 3469 46196 romctrl_1 48% 39 259 acu_1 26% 327 1209 ipb_1 5% 131 105 opb_1 23% 1804 5801 ctrl 9821 135035 total 15591 188605 • controller responsible for 70% of power consumption • maximum resource-sharing • heavy decision-making : “main” loop with 16 metrics-computations per iteration • EXU-numbers include Registers for local storage 13750 Embedded MM Systems on Silicon-5 J. van Meerbergen
GSM viterbi decoder : no loop-folding EXU ACTIV AREA POWER alu_1 92% 3411 45073 romctrl_1 45% 39 255 acu_1 25% 294 1087 ipb_1 5% 107 86 opb_1 22% 1661 5340 ctrl 4919 70087 total 10431 121928 • area down by 33% • power down by 35% • next step: reduce # of program-steps with second ALU 14247 Embedded MM Systems on Silicon-5 J. van Meerbergen
GSM viterbi decoder : 2 ALU’s EXU ACTIV AREA POWER alu_1 69% 1797 12248 alu_2 65% 1393 8916 romctrl_1 67% 39 255 acu_1 37% 294 1087 ipb_1 8% 149 119 opb_1 33% 2136 6871 ctrl 8957 87235 total 14766 116731 9739 • cycle count down 30% • area up 42% • power down by 5% • next step: introduce ASU to reduce ALU-load Embedded MM Systems on Silicon-5 J. van Meerbergen
GSM viterbi decoder : 1 x ACS-ASU func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = EXU ACTIV AREA POWER alu_1 20% 261 105 acs_asu_1 83% 2382 3816 or_asu_1 10% 611 122 romctrl_1 16% 65 21 acu_1 36% 294 205 ipb_1 20% 107 43 opb_1 11% 163 35 ctrl 1864 3597 total 5747 7944 1930 • cycle count down 5X • power down20X! Embedded MM Systems on Silicon-5 J. van Meerbergen
GSM viterbi decoder : 4 x ACS-ASU EXU ACTIV AREA POWER alu_1 94% 243 97 acs_asu_1 95% 1041 420 acs_asu_2 95% 1041 420 acs_asu_3 95% 1041 420 acs_asu_4 95% 1041 420 split_asu_1 47% 90 18 or_asu_1 47% 592 118 romctrl_1 28% 48 6 acu_1 98% 212 85 ipb_1 23% 60 6 opb_1 50% 369 80 ctrl 1306 555 total 7084 2645 425 • cycle count down another 5X • area up 23% • power downanother 3X! Embedded MM Systems on Silicon-5 J. van Meerbergen
Implementation Independent Design Database GSM viterbi example : summary Mistral2 72x ! Embedded MM Systems on Silicon-5 J. van Meerbergen
OK? OK? more appl.? Discussion: phase 3 processor- model application(s) application(s) SW (code generation) HW design SW (code generation) Freeze processor model no no no yes yes no yes Application software development: constraint driven compilation Exploration phase Embedded MM Systems on Silicon-5 J. van Meerbergen
Discussion: problems with VLIWs code size and instruction bandwidth • code compaction = reduce code size after scheduling • possible compaction ratio ? • e.g. p0 = 0.9 and p1 = 0.1 • information content (entropy) = - pi log2 pi = 0.47 • maximum compression factor 2 • control parallelism during scheduling = switch between • different processor models (10% of code = 90% runtime) • architecture • reduce number of control bits for operand addresses • e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only • => use stacks and fifos Embedded MM Systems on Silicon-5 J. van Meerbergen
RF2 RF1 RF3 RF4 FU3 FU4 FU1 FU2 flags IR3 IR4 IR1 IR2 Instruction memory Con- trol Embedded MM Systems on Silicon-5 J. van Meerbergen
Discussion: clustered VLIW architectures RF1 RF2 RF3 RF4 FU1 FU2 FU3 FU4 Embedded MM Systems on Silicon-5 J. van Meerbergen
Conclusions • ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). • The methodology is interesting for IP creation. • The key problem is retargetable compilation. • A (distributed) VLIW model is a good compromise between HW and SW. • Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. Embedded MM Systems on Silicon-5 J. van Meerbergen