300 likes | 422 Views
Università di Catania Dipartimento di Ingegneria Informatica e delle Telecomunicazioni. Exploring Design Space of VLIW Architectures. Giuseppe Ascia, Vincenzo Catania, Maurizio Palesi and Davide Patti. DIIT - University of Catania, Italy. Outline. Introduction VLIW in past & future
E N D
Università di Catania Dipartimento di Ingegneria Informatica e delle Telecomunicazioni Exploring Design Space of VLIW Architectures Giuseppe Ascia, Vincenzo Catania, Maurizio Palesi and Davide Patti DIIT - University of Catania, Italy
Outline • Introduction • VLIW in past & future • Design Exploration Framework • ILP oriented compilation • Genetic Design Space Exploration • Conclusions
Instruction Level Parallelism • high performance processors in the 1980s: maximize ILP • Issue more than one single instruction in a given clock cycle • Who decides which instructions can be executed in parallel? • Two different philosophies: • Superscalar • Very Long Instruction Word (VLIW)
Run-time Foo.c Instruction stream Op1 Op2 Op3 Op4 Op5 … Op1,Op2 Op3 Op4,Op5 … compiler HW ILP philosophy: Superscalar • Hide the process of finding ILP • ILP is discovered dynamically at run-time by the control hardware of the processor
Run-time Plan of execution Op1,Op2 Op3 Op4,Op5 compiler HW Foo.c Hardware resources configuration ILP philosophy: VLIW • Hardware resources are architecturally visible to the compiler • Compiler can create a sequence of Very Long Instructions that defines the plan of execution • HW simply execute the plan
VLIW past & future • Decline of VLIWs for general purpose systems: • Couldn’t be integrated in a single chip • Binary compatibility between implementations • Rediscovery of VLIW in embbeded • No more integrability issues • Binary incompatibility not relevant • Advanteges of VLIW: • Simplified hardware • optimize ad-hoc the architecture to achieve ILP
Reference architecture (HPL-PD) Prefetch Cache Fetch Unit Decode and Control Logic Instruction Queue L1 Data Cache Prefetch Unit L2 Unified Cache L1 Instruction Cache Predicate Registers Branch Registers General Prupose Registers Floating Point Registers Control Registers Branch Unit Integer Unit Floating Point Unit Load/Store Unit
Configuration Space Three main parameter categories: • VLIW core: • Number of Registers in each register file (from 16 to 256) • Number of istancies for Functional Units of each type (from 1 to 6) • Mem Hierarchy: • Size, Blocksize, Associativity for each of the caches (L1 Instruction, L1 Data, L2) • Compiler: • Conservative compilation strategy (basic blocks) • Aggressive ILP oriented compilation strategy (hyperblocks) Total space size: 1.47 x 1013configurations !
Configuration Compiler Simulator Estimator Application.c Exploration Algorithm Performances, Power, … Pareto configurations Required Tools • High level estimation models • Design Space Exploration strategy
An Open Platform: EPIC Explorer • Interfacing to the Trimaran framework that provide VLIW compiler and simulator for dynamic statistics. • Estimator component implementing high level models • Explorer component implementing multi-objective design space exploration algorithms
Foo.c IMPACT ELCOR Emulib foo.exe Execution statistics Processor Memory Explorer Estimator System configuration Cycles Energy Power The Exploration Data Flow
Energy estimation • Subdivide architecture in Functional Block Unit (FBU) • Instruction decode logic, Integer units, floating point units, register files • For each FBU (from ST Microelectronics LX) • Active power: average power dissipated when the FBU is used • Inactive power: average power dissipated when the FBU is not used • From the execution statistic, we know how many cycles each FBU has been active/inactive • EFBU=(Pactivecyclesactive+ Pinactivecyclesinactive) Tclock • Discrete degree of accuracy (about 25%) • investigate relative power savings beetween designs
Reference Application Set • Chosen from MediaBench suite
Exploration Methodology • Preliminary analisys of compilation • Impact of ILP oriented code transformations • Predict the right compilation strategy: • Basic Blocks (conservative) • Hyper Blocks (aggressive, ILP-oriented) • Multi-objective Design Space Exploration • Extract Pareto Set
Random subsets of n configurations CN ON T-test Compilation with (H) and without (N) hyperblock formation CH OH Is the mean effect on the objective significant respect to the chosen critical difference? Preliminary Analisys (1/3) • For each objective,Unpaired two sample t-test allows to estimate the average effect of hyperblock formation Configuration Space
Preliminary Analisys (2/3) • Example of a metric for critical difference in means: d > 50% M
Preliminary Analisys (3/3) ILP-oriented compilation impact (positive,negative)
Chromosome Size BSize Assoc Func units Register Files DSE: Genetic Mapping Mem Cache VLIW core Bus ctrl
Simulation Estimation Architecture configuration Performance Power Individual Fitness Evaluation Crossover Mutation Discendant Selected ? New Architecture configuration DSE: Genetic Iteration Current Population
DSE: Experimental Results • Parameters : • Initial population: 30 individuals • Crossover probability: 0.8 • Mutation probability: 0.1 • Generations: 50 • Example of two different scenarios: • G721 encode: exploration should include the exploration of compilation strategy • Gsm-encode: hyperblock formation is predicted to be a better choice
Conclusions • Open platform for VLIW space exploration • Estimate Power, Energy and Performance • Preliminary Analisys of ILP-oriented compilation • Genetic multi-objective design space exploration • Future developments • Clustered VLIW • Network-on-chip multiprocessors • Open source: http://epic-explorer.sourceforge.net
Appendix • Bus Power Estimation • Implemented Algorithms • Multiobjective Fitness assignment • How Many Generations?
Power Estimation (buses) • Bus lines transitions computed from the list of data/address memory accesses Pbus = 0.5 (Vdd)2 f Cl • Vdd supply voltage • switching activity • f clock frequency • Cl capacity of a bus line
Design Space Exploration Implemented Algorithms : • Exhaustive: intuitive, simple and …unfeasible • Dependency analysis (dep), Givargis et al.,[TVLSI’02] • GA-based DSE (ga), Palesi et al., [CODES’01] • Sensitivity Analysis, Fornaciari et al., [DAES’02] • Pareto-based Sensitivity Analysis (pbsa), Palesi et al., [VLSI-SOC’01]
Multiobjective Fitness assignment • Strength Pareto Approach [Zitzler,Thiele] • From current population P , is extracted an external set P*, containing the nondominated configuration of P. • Fitness of P* element j : fj = n/(N+1) • N = total size of P • n = # of P configurations dominated by j • Fitness of P element i: 1/S . • S is the sum of the fitness values of the P* elements that dominates i
How Many Generations? • Fixed number of generations • Autostop criteria • Based on convergency delay power