490 likes | 634 Views
Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures. David López, Josep Llosa Mateo Valero and Eduard Ayguadé Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya. Goals. Modify resources to exploit ILP in VLIW architectures
E N D
Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya Micro-31
Goals • Modify resources to exploit ILP in • VLIW architectures • numerical code • innermost loops • A study of performance and cost • Technological projection Micro-31
Outline • Replication and Widening • Performance • maximum ILP achievable • effects of spill code • Design considerations • Performance under a technological limit • Conclusions Micro-31
Bus Register File FPU Basic architecture • 1 bus between the register file (RF) and the first-level cache • 1 general purpose floating point functional unit (FPU) Micro-31
Bus Register File FPU VLIW FPU others memory Basic architecture • 2 operations can be issued per cycle: • 1 memory • 1 FPU Micro-31
2 buses + 2 FPU 4 operations can be issued per cycle: 2 memory (independent) 2 FPU (independent) Bus Bus Register File FPU FPU VLIW FPU others memory Replication Micro-31
The bus, the FPU and the RF are widened 4 operations can be issued per cycle: 2 memory (in consecutive memory addresses) 2 FPU (the same operation) less versatile Bus Register File FPU VLIW FPU others memory An alternative: widening Micro-31
Software pipelined loops • Loops performance is limited by • recurrences • resources • Software pipelining overlaps the execution of several consecutive iterations • With a perfect scheduling, at least one resource is occupied at 100% (unless the loop is recurrence-bound) Micro-31
3 memory operations: loads A and B, and store C A y B have stride 1 with themselves in the next iteration 3 floating point operations + has a recurrence with itself in the next iteration let’s assume a latency of 2 cycles 1 1 A B D * * + 1 C How widening works? Micro-31
1 2 3 4 iteration Load A Load B * * + Store C Execution of several iterations 1 bus + 1 FPU : 3 cycles / iteration 2 buses + 2 FPU: 1.5 cycles / iteration but 2 cycles are required due to the recurrence, so 2 cycles / iteration Micro-31
“Compactable” operations (width 2) Reason (López et al. ICS97) 1 2 3 4 iteration Load A No dependency and stride 1 Load B * No dependency * + Dependency Store C No dependency, but no stride 1 2 cycles / iteration Micro-31
Limits on ILP • Baseline configuration (1w1) :1 bus and 2 FPU • Configurations XwY : • X: degree of replication: X buses, 2*X FPU • Y: degree of widening (width of the resources) • Characteristics of the architecture: • store is served in 1 cycle, division (19 cycles) and SQRT (27 cycles) are not pipelined • the rest are fully pipelined with a latency of 4 cycles Micro-31
Performance: Replication Workbench: 1180 loops that account for 78% of the execution time of the Perfect Club Micro-31
Performance: R vs Widening Micro-31
Performance: R vs W vs Combined Micro-31
Scheduling and register assignment • Loops have been software pipelined using HRMS (Llosa et al. MICRO-28), a register pressure sensitive heuristic. • Register allocation has been performed using wands-only strategy and the end-fit with adjacency ordering (Rau et al. PLDI-92). • When a loop requires more registers than the available, spill code is added. Micro-31
Register pressure • Reducing the cycles required per iteration can increase the register requirements. • Widening is also applied to the register file • more storage capacity (and less register pressure) • not cheating! If there are no compactable operations, we do not benefit from this additional capacity Micro-31
Effects of adding spill code Baseline : 1w1 with a 256 RF Micro-31
Area cost • Cost of the FPU: widening and replication have the same cost • The area of the RF grows as the square of the number of ports Micro-31
Register file access time • Based on the CACTI model (Wilton & Jouppi J. of Solid-State Circuits 96) for cache memory. • Normalised to configuration 1w1 with a 32-RF and a technology of l=0.05. • Widening the RF is cheaper than adding ports • Increasing the number of registers is cheaper than adding ports Micro-31
Effect of the RF size Configuration 1w1 Micro-31
Effect of the studied techniques Micro-31
Cost of widening and replication • Area: • replication: quadratic increment • widening: linear increment • Cycle time: • the increment of cycle time applying replication is greater than applying widening • the RF can be partitioned into several copies, reducing the access time but increasing the area Micro-31
Performance/cost trade-off • Configurations XwY(Z:n) where: • X is the replication degree • Y is the widening degree • Z is the RF size (32, 64, 128 or 256) • n is the number of blocks in which the RF has been partitioned Micro-31
1998 2001 2004 2007 2010 0.25 0.18 0.13 0.10 0.07 l (mm) 360 Size (mm2) 300 430 520 620 4800 52,000 126,530 l2 per chip (x106) 11,111 25,443 Semiconductor Industry Association Configurations that can be implemented • We use the SIA predictions • FPU + RF area cost must be smaller than 20% of the total chip area available Micro-31
FPU latency • We compare configurations adapting the latency of the FPU to the processor cycle time • A configuration with a relative cycle time Tc belongs to the z-cycles model wherez=4/Tc Micro-31
Effect of the RF size • The same configuration, changing the RF size • A big RF needs less spill code, but has a big penalty in access time Micro-31
Effect of the studied techniques Only Replication Only Widening Micro-31
XwY where X*Y=8 • The same RF size and peak performance • Combining small degrees of replication and widening results in the best performance Micro-31
Top five configurations (i) l= 0.18 • The five configurations that achieve the best performance for l=0.18 are showed. • Blue ones: the ones with best performance/cost • In all the technology generations, the best ones use widening Micro-31
Top five configurations (ii) l= 0.13 l= 0.10 Micro-31
Conclusions • Study of two techniques to extract ILP: replication and widening • Study of aggressive configurations in optimal conditions: • replication achieves best performance • widening costs less • Study of the cost of both techniques Micro-31
Conclusions • Applying small degrees of replication and widening results in best performance under a technology limit • widening has more storage capacity • less spill code • replication has more area requirements • some configurations become not implementable • RF access time is shorter using replication than using widening Micro-31
RF area cost Read data line Write select line Read select line Write data line Write data line Micro-31
1 1 A0 A1 B0 B1 A B *0 *1 D D D * *0 *1 * +0 +1 + 1 1 C0 C1 C Unrolling the loop Micro-31
A0,1 B0,1 A0 A1 B0 B1 D *0,1 *0 *1 D D *0,1 *0 *1 +0 +1 +0 +1 1 1 C0 C1 C0 C1 Compacting Micro-31
A0,1 B0,1 Bus Register File D *0,1 *0,1 +0 +1 FPU 1 C0 C1 Execution of a compacted loop Micro-31
A loop is bounded by recurrences and resources. Assume the basic architecture (1 bus and 1 FPU) with latency of 2 cycles A0 A1 B0 B1 *0 *1 D D *0 *1 +0 +1 1 C0 C1 Limits Micro-31
Limits: resources and recurrences A0 A1 B0 B1 *0 *1 D D *0 *1 +0 +1 1 C0 C1 Micro-31
A0 A1 B0 B1 *0 *1 D D *0 *1 +0 +1 1 C0 C1 Reducing the resources limits Micro-31
Effect of replication and widening 1w1: 3 cycles/it 2w1: 2 cycles/it 1w2: 2 cycles/it Micro-31
Taxonomy of loops Recurrences Compactab. Non compa. Don’t care Micro-31
Top five configurations l= 0.25 Micro-31
007 Micro-31