200 likes | 222 Views
Explore the implementation of Block Structured Architecture (BSA) in processor design to address complexity and fetch problems, with advantages in predication and intra-block communication. Learn about BSA's fixed-length blocks, challenges, and statistical modeling.
E N D
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Introduction • Out-of-order architecture • dynamically schedules independent instructions • Higher ILP through • more powerful processor core • fast instruction delivery • But … this increases the hardware complexity significantly!
Hardware complexity processor core instruction window O (n2) bypass logic long wires [Palacharla et al. 1996] register file many ports [Farkas et al. 1995] fetching fetch bandwidth multiple branches cache access
Solutions processor core • decentralization: • trace processor [Rotenberg et al. ‘97] • multiscalar architecture • [Sohi et al. ‘95] • clusters (Alpha 21264) fetching • bigger units of work: • trace in trace processors • task in multiscalar architecture • block in block-structured ISA • [Melvin and Patt ‘95; Hao et al. ‘96]
Basic idea of BSA • Fixed-Length Block Structured Architecture (BSA) • addresses • processor core problem • fetching problem • by appropriate microarchitectural and implementational • design decisions BSA is a feasible architectural paradigm for future processors
BSA-block (p1) (~p1) basic block basic block (p2) (~p2) basic block basic block Block Structured Architecture overcoming the fetch problem • Advantages: • predication: elimination of unbiased branches • intra-block communication: less register file ports required • fixed-length BSA-blocks: easier fetching • Disadvantages: • BSA-block not always filled • higher memory bandwidths • bigger instruction caches • BSA-block compression basic block BSA-block is atomic unit of work • no control flow • predication • static register renaming • data-flow execution • fixed-length
instruction cache fetch unit branch predictor block engine block engine block engine block engine FU1 FU2 data cache register file Block Structured Architecture overcoming the processor core problem fixed-length BSA-block speculative execution fast intra-block communication slow inter-block communication instruction window
Decentralization (1) out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies • scaling out-of-order architectures • to higher levels of ILP • for future technologies • is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects
Decentralization (2) • lower IPC • slower interconnections (1 cycle latency) • bad virtual instruction window utilization • due to higher granularity • higher clock frequency F • decentralization • performance = IPC x F • higher performance for large virtual window sizes
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Statistical Modeling extraction of distributions benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions 1 2 microarchitectural parameters 3 BSA-block size b trace-driven simulator synthetic trace synthetic trace generator 5 4 6 IPC
Synthetic BSA-trace Generation generate control flow BSA-block 1 basic block actually executed • determine basic block size • add basic block to most likely execution path • until b instructions in BSA-block 0.65 0.35 2 basic block 4 basic block generate data flow • instruction type • number of operands • age of register operands 0.25 0.40 0.20 0.15 5 basic block 3 basic block • determine actually executed control flow path 0.20 0.05 0.20 0.20
Benchmarks • SPECint95: integer • SPECfp95: floating-point • MediaBench: signal and multimedia processing • MPEG-4 like algorithms • measuring program characteristics through instrumentation (ATOM) on Alpha architecture
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Instruction Mix • Load/store instructions • SPECint95 40.6% • SPECfp95 37.7% • multimedia 29.2% • Branch instructions • SPECint95 14.0% • SPECfp95 3.6% • multimedia 8.5% • Some multimedia applications have floating-point instructions
Control-intensitivity • Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate • SPECint95 80.1 7.3 9.1% • SPECfp95 415.3 25.0 6.0% • multimedia 156.9 14.3 9.1%
BSA-block formationnumber of useful instructions 100% 90% 80% fraction useful instructions 70% avg media avg SPECint95 60% avg SPECfp95 50% 16 32 64 128 BSA-block size
BSA-block formationpredictability of multi-way branch multimedia integer floating-point 100% 90% 80% 70% 60% multi-way branch predictability 50% 40% 16-instruction block 30% 32-instruction block 20% 64-instruction block 10% 0% • 16-instruction block: 90% in most cases • 32-instruction block: low for several integer applications • 64-instruction block: only for floating-point applications
Conclusions • Multimedia applications are less control-intensive than integer applications • due to larger basic block size under comparable branch predictability • Multimedia applications are more control-intensive than floating-point applications • due to smaller basic block size and lower branch predictability • 16 instructions per BSA-block is appropriate • larger blocks result in higher (multi-way) branch misprediction rates