Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Design of a Parallel Vector Access Unit for SDRAM Memory Systems Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew

Motivation • Current microprocessors are very powerful, but ... • Irregular applications still perform poorly • Vectorizable loop • e.g. : for(i = 0; i < L * S; i += S) y[i] += a[i] * x[i]; • Poor cache utilization • Cache pollution • Poor bus utilization • Access pattern may be predictable • Memory system enhancements for vectors • Vectors are back again

A Vector Memory Controller • Handle both strided and normal accesses • Fast scatter/gather • Efficient cache-line fills • New fast Parallel Vector Access Algorithm • Scheduling heuristics • Prototype implementation

The Serial Vector Access Problem Vector = < Base Address, Stride, length > 0 1 2 3 4 5 6 7

The Serial Vector Access Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System

The Serial Vector Access Problem V = < 1024, 32, 16 >, strided access 1024 1056 1088 1120 ... 0 1 2 3 4 5 6 7 Cache-line Interleaved Serial Memory System

Parallel Vector Access • Serial vector access : Low throughput • Exploit bank parallelism • Exploit internal parallelism • History of Parallel Vector Access • CVMS : Corbal, Espasa, Valero • Two to 15 cycle algorithm • Interconnect and crossbar • Module stride : Steven Moyer • Our PVA Algorithm • Two to three cycles, Merge on Bus • Scalable • Word and Block interleave

Memory Organization

The PVA Problem V = < 1024, 1, 16 >, cache-line fill 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7

The PVA Problem : Stride 2 V = < 1024, 2, 16 > 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 0 1 2 3 4 5 6 7 Bank Access Sequence : 0, 2, 4, 6, 0, 2, 4, 6, ...

0 1 2 3 4 5 6 7 The PVA Problem : Stride 3 V = < 1024, 3, 16 > 1024 1027 1030 1033 1036 1039 1042 1045 1048 1051 1054 1057 1060 1063 1066 1069 Bank Access Sequence : 0, 3, 6, 1, 4, 7, 2, 5, 0, ...

Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests

Our PVA Solution : Stride 2 V = < 1024, 2, 16 > 0 1 2 3 4 5 6 7

Hit, 0 No Hit Hit, 1 No Hit Hit, 2 No Hit Hit, 3 No Hit δ=4 δ=4 δ=4 δ=4 1024 1026 1028 1030 1032 1034 1036 1038 1040 1042 1044 1046 1048 1050 1052 1054 Our PVA Solution : Stride 2 0 1 2 3 4 5 6 7

Our PVA Solution • Functions • FirstHit(V,b) : Compute first vector element of V that hits b Operations: Table lookup, multiply or shift and add • NextHit(V.S) : Compute incremental index of next element Operations : Trivial PLA • Bank Controller Algorithm • Compute i = FirstHit(V, b) • If there is no hit, continue • Till the end of the vector is reached do : • Schedule access memory location V.B + i * V.S • i = i + NextHit(V.S) • Scheduling heuristics • Early row open • Reordering and interleaving requests

Hardware Prototype • Verilog model : Approximately 3600 lines of code • Timing estimates with FPGA, gate level simulation • Hardware cost per bank controller • Approximately 11000 gates • 2K bytes on-chip RAM • Target • CPU : R10000 • L2 cache-line size : 128 bytes • System bus : 64 bits wide, split transaction • Four outstanding memory requests • Memory: 256 Mbit Micron SDRAM at 100 MHz • 32 bits x 16 banks • Word interleaved

Register File Access Scheduler Vector Contexts SDRAM Interface PVA Implementation Request FIFO FirstHit Predict FirstHit Calculate Vector Bus Staging Unit SDRAM Bus

Performance Evaluation • Kernels • 240 data points • Compared PVA with : • Cache-line interleaved SDRAM 64 bits wide Burst length of 16 • Scatter/Gather serial SDRAM 32 bits wide, 16 banks Overlapped RAS and precharge • Parallel Vector Access SRAM 32 bits wide, 16 banks Single cycle latency, pipelined SRAM

Results : Cache-line fills

Results : Strided Access 10.86 32.69

Results : SRAM Comparison

Future Work • Full program simulation • Integration with virtual memory • Parallel access techniques for other patterns • Indirect vectors • FFT • Impulse ASIC

Summary • New and improved PVA Algorithm • Up to five times improvement over older method • Speedups in the range 1.0 to 32.8 • Technique for block interleaving • Scalable • Hardware prototype designed • Moderate hardware complexity

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Presentation Transcript

Memory Access Cycle and the Measurement of Memory Systems

A Survey of DDR4 SDRAM Design Improvement Methods

Vector Machines Model for Parallel Computation

Scientific Computations on Modern Parallel Vector Systems

SDRAM

Impossibilities for Disjoint-Access Parallel Transactional Memory :

Unit -4 Memory System Design

A Memory-Efficient Parallel String Matching for Intrusion Detection Systems

The Design of Asynchronous Memory Management Unit

Inherent Limitations on Disjoint-Access Parallel Transactional Memory

Evaluation of Modern Parallel Vector Architectures

The Design of Asynchronous Memory Management Unit

DDR SDRAM Memory Interface

Scientific Computations on Modern Parallel Vector Systems

Scientific Computations on Modern Parallel Vector Systems

Scientific Computations on Modern Parallel Vector Systems

DDR SDRAM The Memory of Choice for Mobile Computing

Design of Memory Systems for Spaceborne Computers

Design of Memory Systems for Spaceborne Computers

Impossibilities for Disjoint-Access Parallel Transactional Memory :