230 likes | 361 Views
Design of a Parallel Vector Access Unit for SDRAM Memory Systems. Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew. Motivation. Current microprocessors are very powerful, but ... Irregular applications still perform poorly Vectorizable loop
E N D
Design of a Parallel Vector Access Unit for SDRAM Memory Systems Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew
Motivation • Current microprocessors are very powerful, but ... • Irregular applications still perform poorly • Vectorizable loop • e.g. : for(i = 0; i < L * S; i += S) y[i] += a[i] * x[i]; • Poor cache utilization • Cache pollution • Poor bus utilization • Access pattern may be predictable • Memory system enhancements for vectors • Vectors are back again
A Vector Memory Controller • Handle both strided and normal accesses • Fast scatter/gather • Efficient cache-line fills • New fast Parallel Vector Access Algorithm • Scheduling heuristics • Prototype implementation
The Serial Vector Access Problem Vector = < Base Address, Stride, length > 0 1 2 3 4 5 6 7
The Serial Vector Access Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System
The Serial Vector Access Problem V = < 1024, 32, 16 >, strided access 1024 1056 1088 1120 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System
Parallel Vector Access • Serial vector access : Low throughput • Exploit bank parallelism • Exploit internal parallelism • History of Parallel Vector Access • CVMS : Corbal, Espasa, Valero • Two to 15 cycle algorithm • Interconnect and crossbar • Module stride : Steven Moyer • Our PVA Algorithm • Two to three cycles, Merge on Bus • Scalable • Word and Block interleave
The PVA Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
The PVA Problem : Stride 2 V = < 1024, 2, 16 > 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 2, 4, 6, 0, 2, 4, 6, ...
0 1 2 3 4 5 6 7 The PVA Problem : Stride 3 V = < 1024, 3, 16 > 1024 1027 1030 1033 1036 1039 1042 1045 1048 1051 1054 1057 1060 1063 1066 1069 Bank Access Sequence : 0, 3, 6, 1, 4, 7, 2, 5, 0, ...
Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests
Our PVA Solution : Stride 2 V = < 1024, 2, 16 > 0 1 2 3 4 5 6 7
Hit, 0 No Hit Hit, 1 No Hit Hit, 2 No Hit Hit, 3 No Hit δ=4 δ=4 δ=4 δ=4 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 Our PVA Solution : Stride 2 0 1 2 3 4 5 6 7
Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests
Hardware Prototype • Verilog model : Approximately 3600 lines of code • Timing estimates with FPGA, gate level simulation • Hardware cost per bank controller • Approximately 11000 gates • 2K bytes on-chip RAM • Target • CPU : R10000 • L2 cache-line size : 128 bytes • System bus : 64 bits wide, split transaction • Four outstanding memory requests • Memory: 256 Mbit Micron SDRAM at 100 MHz • 32 bits x 16 banks • Word interleaved
Register File Access Scheduler Vector Contexts SDRAM Interface PVA Implementation Request FIFO FirstHit Predict FirstHit Calculate Vector Bus Staging Unit SDRAM Bus
Performance Evaluation • Kernels • 240 data points • Compared PVA with : • Cache-line interleaved SDRAM 64 bits wide Burst length of 16 • Scatter/Gather serial SDRAM 32 bits wide, 16 banks Overlapped RAS and precharge • Parallel Vector Access SRAM 32 bits wide, 16 banks Single cycle latency, pipelined SRAM
Results : Strided Access 10.86 32.69
Future Work • Full program simulation • Integration with virtual memory • Parallel access techniques for other patterns • Indirect vectors • FFT • Impulse ASIC
Summary • New and improved PVA Algorithm • Up to five times improvement over older method • Speedups in the range 1.0 to 32.8 • Technique for block interleaving • Scalable • Hardware prototype designed • Moderate hardware complexity