200 likes | 230 Views
Memory Controller V2. Carlos González. MCV2 - Overview. Per client interface. (readrequest or writedata). (state + readdata). to / from clients. Direct link with clients (two signals per client). Allocate entries & split/distribute.
E N D
Memory Controller V2 Carlos González
MCV2 - Overview Per client interface (readrequest or writedata) (state + readdata) to / from clients Direct link with clients (two signals per client) Allocate entries & split/distribute By default shared by all clients (special mode: pseudo “split” per ROP, simulated using counters) Global Request Buffer (RB) Splitting/distribution is done when allocating a request in the RB Splitter/Distributor Ch0 Ch7 Structure to allow out of order processing among banks (not required by baseline scheduler) B0 B1 B2 B3 B4 B5 B6 B7 B0 B1 B2 B3 B4 B5 B6 B7 Select (oldest first) Select (oldest first) Channel Schedulers protocol ChannelRequest ChannelReply Channel State (per bank) … ChannelRequest ChannelReply Channel State (per bank) • 1R+1W Fifo • nRW Fifos • nR+nW Fifos • nRW Lookahead buffers ChannelScheduler 0 ChannelScheduler 7 ChipRequest ChipReply ChipRequest ChipReply GDDR3/4 protocol GDDR3Chip0 GDDR3Chip7
Simulate a split RB with counters • Only possible for ROP clients and when ROPs are assigned to specific channels. • TextureUnits and other clients keep using the RB as shared. • Split for these clients requires a more complex solution than just counters, very likely implementing a real RB split.
ROPs distributed among channels • ropCounter[nROPS] • Each ropCounter[i] keeps how many transactions are allocated by each ROP • If a counter ropCounter[i] reaches RBsize/nROPS then the MC sends MT_NONE token to the specific ROPi (no more transaction accepted) • Units different than ROPs only take into account RBsize to decide whether more transactions are accepted or not
GPU Architecture overview Scheduler (frags/vtxs) uShader + TU Distributor MC XBAR C-ROP0 MC0 GDDR0 VFetch Z-ROP0 GDDR1 uShader + TU C-ROP1 MC1 GDDR2 PA Z-ROP1 GDDR3 Clipping uShader + TU C-ROP2 MC2 GDDR4 Triangle Setup Z-ROP2 GDDR5 Rasterization uShader + TU C-ROP3 MC3 GDDR6 Z-ROP3 GDDR7 HZ
GPU Architecture overview Scheduler (frags/vtxs) uShader + TU Distributor C-ROP0 MC0 GDDR0 MC XBAR VFetch Z-ROP0 GDDR1 uShader + TU PA uShader + TU C-ROP1 MC1 GDDR2 Z-ROP1 uShader + TU GDDR3 Clipping uShader + TU C-ROP2 MC2 GDDR4 TSetup Z-ROP2 uShader + TU GDDR5 Rast uShader + TU C-ROP3 MC3 GDDR6 uShader + TU Z-ROP3 HZ GDDR7
Basic RB connection to channels Request Buffer From/to ROP i From/to crossbar clients CH0 CS0 GDDR0 C-ROP0 FIFO … … RequestBuffer i Z-ROP0 CS1 GDDR1 … … FIFO From/to crossbar clients CH1 ChannelScheduler i0 ChannelScheduler i1 GDDR3 Chip i0 GDDR3 Chip i1 FIFO CH0 next CT CS0 GDDR0 … C-ROP0 reserve &enqueue Data Buffers Z-ROP0 CS1 GDDR1 reserve &enqueue … From/to crossbar clients next CT FIFO CH1
RB connection to channels for schedulers with independent queues per bank From/to ROP i From/to crossbar clients RequestBuffer i B0 oldest first . . . CH0 CS0 GDDR0 C-ROP0 B7 B0 B1 B2 B3 B4 B5 B6 B7 B0 B1 B2 B3 B4 B5 B6 B7 In-flight MTs Payload Z-ROP0 CS1 GDDR1 Select (oldest first) Select (oldest first) B0 oldest first From/to crossbar clients . . . CH1 Channel Scheduler i0 Channel Scheduler i1 B7 GDDR3 Chip i0 GDDR3 Chip i1 oldest first FIFOs CH0 next CT … CS0 GDDR0 … C-ROP0 reserve &enqueue Data Buffers Z-ROP0 CS1 GDDR1 reserve &enqueue … From/to crossbar clients oldest first FIFOs CH1 … next CT
MC Architecture Others (low traffic) TXT Crossbar ZStencil0 ZStencil1 ZStencil2 ZStencil3 Color0 Color1 Color2 Color3 MC0 (RB0) MC1 (RB1) MC2 (RB2) MC3 (RB3) CS0 CS1 CS2 CS3 CS4 CS5 CS6 CS7 DRAM0 DRAM1 DRAM2 DRAM3 DRAM4 DRAM5 DRAM6 DRAM7
1R+1W_Fifo Scheduler From Data Buffer Next CT from Data Buffer To Data Buffer Read transaction Write transactions CAM ready?, dep?... b=5 r=9 c=0 sz=96 b=3 r=4 c=8 sz=96 ready?, dep?... Select (based on operating mode logic & GDDR state) DataWrite Buffer DataRead Buffer GDDR CMD generator GDDR state To GDDR data pins To GDDR command pins From GDDR data pins
Generic per bank scheduler From Data Buffer Next CT from Data Buffers To Data Buffer BS0 BS1 BS2 BS3 BS4 BS5 BS6 BS7 Channel Transaction Selector + PAM logic (configurable) DataWrite Buffer DataRead Buffer GDDR CMD generator GDDR state To GDDR data pins To GDDR command pins From GDDR data pins
nRWBank From write buffers Channel Transaction To read buffers W r? r=9 c=0 sz=96 BQ1 BQ2 BQ3 BQ4 BQ5 BQ6 BQ7 R r? r=9 c=0 sz=96 Channel Transaction Selector + PAM logic (configurable) DataWrite Buffer DataRead Buffer GDDR CMD generator GDDR state To GDDR data pins To GDDR command pins From GDDR data pins
nR+nWBank From write buffers Channel Transaction To read buffers WQ RQ CAM logic BQ1 BQ2 BQ3 BQ4 BQ5 BQ6 BQ7 Channel Transaction Selector + PAM logic (configurable) DataWrite Buffer DataRead Buffer GDDR CMD generator GDDR state To GDDR data pins To GDDR command pins From GDDR data pins
Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1R+1W_4) T3 T5 T19 T20 T23 T25 T33 T39 T41 T0 T11 T31 … … … … … CK / CK# COMMAND RD WR RD WR RD ADDRESS bank,col e bank,col a bank,col b bank,col c bank,col d RDQS WRQS DQ D0a D0b D0c D0d D0e tWRT = 6 CL=8 tWRT = 6 CL=8 WL=3 WL=3 T3 T7 T14 T15 T19 T21 T23 T0 T13 … … CK / CK# COMMAND WR WR RD RD RD bank,col a bank,col c bank,col b bank,col d bank,col e ADDRESS RDQS WRQS D0a D0c D0b D0d D0e DQ tWRT = 6 WL=3 CL=8
Time Diagram comparing FIFO (FIFO_4) vs. Baseline (1R+1W_4) T3 T5 T19 T20 T23 T25 T33 T39 T41 T0 T3n T4 T4n T11 T31 … … … … … CK / CK# COMMAND RD WR RD WR RD ADDRESS bank Xcol e bank Xcol a bank Xcol b bank Xcol c bank Xcol d RDQS WRQS tRtW=2 DQ DIa DOb DIc D0d D0e tWRT = 6 CL=8 tWRT = 6 CL=8 WL=3 WL=3 T2 T3 T5 T7 T14 T15 T21 T23 T25 T0 T13 … … CK / CK# COMMAND WR WR RD RD RD bank,col a bank,col c bank,col b bank,col d bank,col e ADDRESS RDQS WRQS D0a D0c D0b D0d D0e DQ tWRT = 6 CL=8 WL=3
ACT – CAS - PRE T9 T22 T23 T24 T30 T31 T32 T33 T46 T54 T0 … … … … … CK / CK# tRRD =9 COMMAND RD PRE ACT RD PRE ACT ACT ADDRESS bank Acol a bank a bank Arow N bank bcol b bank a bank Arow M bank Brow O RDQS WRQS DOa DQ DOb RP=9 tRCD=13 CL=8 tRCD=13 CL=8 T9 T22 T23 T24 T30 T31 T32 T33 T37 T46 T54 T0 T45 … … … … … … CK / CK# tRRD =9 COMMAND RD PRE ACT RD PRE RD ACT ACT ADDRESS bank Acol a bank A bank Brow N bank Acol c bank A bank Bcol b bank Arow M bank Arow O RDQS WRQS RP=9 DOa DQ DOb DOc tRCD=13 RP=9 tRCD=13 CL=8 CL=8 tRCD=13 CL=8
Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 GDDR3 DRAM architecture CKE ControlLogic Bank 7MemoryArray CK Bank 6MemoryArray CK# Bank 5MemoryArray CS# Bank 4MemoryArray RAS# Bank 3MemoryArray CMD Decode CAS# Bank 2MemoryArray Sense WE# Bank 1MemoryArray Sense GDDR3simplifieddesign Bank 0Row AddressLatch&Decoder Bank 0MemoryArray Sense Sense Sense Sense Sense @ bits Row buffer 0 Address Register BankControlLogic Column Decoder Data (DDR)
Shared VS Distributed Scheduler (frags/vtxs) uShader + TU Distributor C-ROP0 MC0 GDDR0 MC XBAR VFetch Z-ROP0 GDDR1 uShader + TU PA uShader + TU C-ROP1 MC1 GDDR2 Z-ROP1 uShader + TU GDDR3 Clipping uShader + TU C-ROP2 MC2 GDDR4 TSetup Z-ROP2 uShader + TU GDDR5 Rast uShader + TU C-ROP3 MC3 GDDR6 uShader + TU Z-ROP3 HZ GDDR7
Shared VS Distributed Scheduler (frags/vtxs) uShader + TU Distributor C-ROP0 MC0 GDDR0 Interconnection network (RING, MESH…) VFetch Z-ROP0 GDDR1 uShader + TU PA C-ROP1 uShader + TU MC1 GDDR2 Z-ROP1 uShader + TU GDDR3 Clipping uShader + TU C-ROP2 MC2 GDDR4 TSetup uShader + TU Z-ROP2 GDDR5 Rast uShader + TU C-ROP3 MC3 GDDR6 uShader + TU Z-ROP3 HZ GDDR7