290 likes | 300 Views
This publication explores the use of vector extensions in ILP processors to accelerate loops in multimedia and DSP codes. It discusses different vectorization techniques and their impact on scalar resources and highlights the importance of both scalar and vector resources in optimizing performance. The paper also presents case studies and evaluates different approaches for selective vectorization and software pipeline vectorized loops.
E N D
Exploiting Vector Parallelismin Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
Multimedia Extensions • Short vector extensions in ILP processors • AltiVec, 3DNow!, SSE, etc. • Accelerate loops in multimedia & DSP codes • New designs have floating point support
Vector resources do not overwhelm the scalar resources Scalar: 2 FP ops / cycle Vector: 4 FP ops / cycle Full vectorization may underutilize scalar resources ILP techniques do not target vector resources Need both Multimedia Extensions Courtesy of International Business Machines Corporation. Unauthorized use not permitted.
LOAD LOAD MULT II = 2 mod sched ADD Modulo Scheduling for (i=0; i<N; i++) { s = s + X[i] * Y[i]; }
II = 2 mod sched II = 2 traditional Traditional Vectorization for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } for (i=0; i<N; i++) { s = s + S[i]; } for (i=0; i<N; i++) { s = s + S[i]; } +1
II = 2 mod sched II = 3 traditional II = 1.5 no distrib Vectorization without Distribution for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i:i+1]; s = s + S0 s = s + S1 }
II = 2 mod sched II = 3 traditional II = 1.5 no distrib II = 1 selective Selective Vectorization for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i]:Y[i+1]; s = s + S0 s = s + S1 }
Complications • Complex scheduling requirements • Particularly in statically scheduled machines • Memory alignment • Example assumes no communication cost • In reality, explicit operations required • Often through memory • Reserve critical resources • Potential long latency • Performance improvement still possible
Tomcatv (SpecFP 95) 1.7x Speedup over Modulo Scheduling
Selective Vectorization • Balance computation among resources • Minimize II when loop is modulo scheduled • Carefully manage communication • Incorporate alignment information • Software pipelining hides latency • Adapt a 2-cluster partitioning heuristic • [Fidduccia & Matheyses ’82] • [Kernighan & Lin ’70]
LOAD LOAD MULT ADD Selective Vectorization scalar vector cost
Cost Function • Projected II due to resources (ResMII) • Bin-packing approach [Rau MICRO ’94] • With some modifications • Can ignore operation latency • Software pipelining hides latency • Vectorizable ops not on dependence cycles for (i=0; i<N; i++) { X[i+4] = X[i]; }
SUIF Front-end Dependence Analysis Dataflow Optimization SUIF to Trimaran Selective Vectorization Modulo Scheduling Evaluation C or Fortran • SUIF front-end • Dependence analysis • Dataflow optimization • Trimaran back-end • Modulo scheduler • Register allocator • VLIW Simulator • Added vector ops Simulation Binary
Evaluation • Operands communicated through memory • Software responsible for realignment
Evaluation • SpecFP 92, 95, 2000 • Easier to extract dependence information • Detectable data parallelism • 64-bit data means vector length of 2 • Considered amenable to vectorization & SWP • Apply selective vectorization to DO loops • No control flow, no function calls • Fully simulate with training sets
Selective Vectorization tomcatv mgrid su2cor swim
Communication Support • Transfer through memory • Register to register copy • Uses fewer issue slots • Frees memory resources • Shared register file • Vector elements addressable in scalar ops • Requires no extra issue slots
Through Memory tomcatv mgrid su2cor swim
1.2x improvement Reg to Reg Transfer Support tomcatv mgrid su2cor swim
1.28x improvement Shared Register File tomcatv mgrid su2cor swim
Related Work • Traditional vectorization • Allen & Kennedy, Wolfe • Software Pipelining • Rau’s iterative modulo scheduling • Clustered VLIW • [Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34] • Partitioning among clusters similar • Ours is also an instruction selection problem • No dedicated communication resources
Conclusion • Targeting all FUs improves performance • Selective vectorization • Vectorization better in the backend • Cost analysis more accurate • Software pipeline vectorized loops • Good idea anyway • Facilitates selective vectorization • Hides communication and alignment latency