1 / 45

High Speed Cache

Download Presentation

High Speed Cache

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NOTE TO FUTURE VIEWERS OF THESE SLIDES: ALL YELLOW TEXT BOXES ACCOMPANIED BY ARROWS IN THE DIRECT VICINITY OF THE YELLOW TEXT BOXES WERE ADDED AFTER THE PRESENTATION, AS PER PROF. CALHOUN’S REQUEST, IN ORDER TO MAKE OUR PRESENTATION MORE UNDERSTANDABLE WHEN VIEWING THESE SLIDES NOT DURING THE ORAL PRESENTATION. High Speed Cache For: PICo Board Proposal By: Team XOR

  2. Outline • Architecture • How we set up our memory • Decoding • Various techniques we implemented and tested to optimize decoding • Layout • Layout techniques we implemented to minimize area • Simulations • Functionality of our SRAM cache • Presentation of Metric

  3. Problem • Need a high speed cache which also uses minimal area and energy • Approach • Row decoding is the worst case path  decode as fast as possible • Have a compact layout to minimize total area and decrease parasitics

  4. Block diagram of our entire memory. We discussed how our memory worked (Write then Read operation). Very briefly discussed each component shown and how our address bits came about and what we use each bit for. Most components are typical for a memory except the TXGate Control which we mentioned and said we would discuss later in the presentation.

  5. BL/BLB/PRECH Generator

  6. WL127 WL 0 WRITE DATA DATA PRECH SAE Q2 Fight between bitcell data Q and BL Q2B BL BLB Reading a 1 out0 Reading a 0

  7. Addition of TXgates to Disconnect BL/BLB drivers from BL/BLB to allow BL/BLB to float during a read To Block To Block

  8. SAE PRECH Only small bump in Q with addition of TXGateControl Q2 Q2B Q1 Q1B BL BLB out0 Reading a 0 Reading a 1

  9. Bit Cell Ratio Testing WPULLUP WPULLUP WPASSGATE WPULLDOWN WPULLDOWN WPASSGATE Vm Testing Pull-up Ratio Cell Ratio Want: Q as close to VT as possible, but sizes get too large, so tried to find something between VT and VM. WPULLUP = 180n ; WPULLDOWN = 240n ; WPASSGATE = 200n

  10. Thin Cell Advantages: Smallest possible area of 6T Bit Cell, Can be mirrored (saves area = can reduce distance between n-wells and p-wells)

  11. 2by2 Array of Thin Cell Layout Advantages: WLs are horizontal, VDD/VSS/BL/BLB are vertical, Mirrored Thin Cells save area and make it easy to add N/P Taps, Easy to Cascade to other 2by2 Arrays

  12. Peripheral Logic • Problem: Needed to generate signals such as precharge, prechargebar, etc. from the given inputs • We created the following signals: • Local Write Signals • Local Read Signals • Precharge • Prechargebar • Txcontrol • Txcontrolbar • SenseAmp Enable (already localized)

  13. Decoding • Need: High Speed • Decoding the proper row location is on critical path • Considered numerous options • Static • Dynamic

  14. Decoding • Based on our architecture, need to decode 10 bits into the proper Word Line • 8 blocks  3 block select bits • 128 rows/block  7 row select bits • 3-level decoder • Predecoders • AND combinations of predecoded bits to generate global word line • Local word line generation by ANDing global word line with block select

  15. Decoding • Critical path requires decoding bits 6 to 0 into the proper row (0 through 127) • Thus, we chose to implement this part as dynamic decoder • The 3 to 8 block decode would occur in parallel, and would be done much quicker (since only needs to generate 8 signals) , thus we can conserve power (and don’t affect delay) by using a 1-hot static decoder 9 8 7 6 5 4 3 2 1 0 Static 3 to 8 Block Decoder Dynamic 3 to 8 Predecoder Dynamic 4 to 16 Predecoder

  16. Decoding • Dynamic Decoding of 7 row select bits • 2 predecoders: • 4 to 16 (upper bits) • 3 to 8 (lower bits) • Asymmetric predecoders forced us to design 3 to 8 predecoder to have the same delay as 4 to 16 predecoder to reduce glitching power • Used DRCMOS and skewing techniques • NOR-style predecoders (same logical effort for larger inputs) with complemented inputs

  17. Decoding • Comparison of Dynamic 4 to 16 Predecoders • Non-skewed • 3.140 mW average power, 141.1 ps delay to global WL • Skewed (2x bigger pmos) • 3.175 mW average power, 126.7 ps delay to global WL • Skewed (min widths) • 2.577 mW average power, 55.73 ps delay to global WL

  18. Decoding • Static vs. Best Dynamic Decoder • Static (2-input NAND) • 7.025mW average power, 100.0 ps delay • E-D product = 702.5 mW*ps • E-D2 product = 70,250 mW*ps2 • Dynamic (DRCMOS, skewed) • 13.1 mW average power, 55.73 ps delay • E-D product = 730.1 mW*ps • E-D2 product = 40,686 mW*ps2 • Thus, we reduced our metric by ~42%

  19. Decoding • Combining predecoder outputs • Static combinations • Each of 16 MSB outputs are ANDed with 8 LSB outputs to create 128 Global Word Lines WL7 WL6 WL5 WL0 4 to 16 3 to 8

  20. Decoding • Local Word Line Generation • Must take into account parasitics associated with long metal LWL4 R BLOCK0 C/2 C/2 LWL4 Parasitic modeling of decoder wires BLOCK1 GWL4 Block Select 0 1

  21. Decoding • 3-level decoder optimization requires sweeping number of buffers on decode path • Potential Locations of Buffer • Immediately after predecoders • Before the decode wires (parasitic models) • After generating the local WLs • We buffered before the decode wires • Without Buffering: • Delay: 126.86 ps ; Power = 5.655 mW • With Buffering: • Delay: 100.46 ps ; Power = 5.523 mW

  22. Decoding • Wanted to use “Source-coupled” NAND gates to generate the local word lines • Ran into charge sharing problems Global WL Local WL Block Select Figure: Schematic of Source-Coupled NAND

  23. Decoding Notice: 0.25V output! This is incorrectly de-asserted!

  24. Decoding • Problems with DRCMOS • Tried using similar schematics as those found in literature which resulted in oscillating predecode output signals • Fix: Removed some stages from literature schematic • Also, ran into strange glitching of inputs to NOR-style predecoder • Fix: Usually none… could drive inputs more, but the power losses were found to be acceptable when comparing the tradeoff for speed. • Note: We did observe that slowing down the inputs to the predecoder could reduce these glitches

  25. Decoding Notice: Glitching of input signals! Notice: 3 pulse oscillation of predecode output

  26. Decoding Nice Local WL!

  27. Decoder

  28. Our Architecture

  29. TX Gates (4x Min Size) – Disconnects BL Drivers, Avoids Fight between Bitcells and BL Drivers Advantages: Outputs nicely spread apart, Select lines are all tied together and come in from side, Inputs from top, Outputs on bottom, Easily N-Tapped and P-Tapped from left or right side, Easily mirrored.

  30. Buffers (First Inv = Min Size, Second Inv = 4x Min) Advantages: Needs to be thin to have 2 Buffers to be smaller than the width of a Thin cell, Easy to souce VDD and VSS, Easily connect inputs and outputs (top and bottom of diagram), Easily N-Tapped, P-Tapped

  31. Precharge/BL/BLB Generator Advantages: Pitched Matched (Made it as thin as possible while fitting it in with rest of circuit), BL/BLB are on the outsides running vertically, Easily P-Tapped, N-Tapped, Thin, PRECH, WRITE, and Data Signals

  32. Word Select (1to2 DEMUX) – Sends Data to Column Advantages: Data comes from top, Address from side (design decision), N-wells together, P-wells together (easy to add N-taps and P-taps), easily mirrored

  33. Sense Amp Advantages: Wanted as short as possible, but width had to be smaller than width of Thin Cell, Similar to Thin Cell by using Cross Coupled Inverter layout, Easily mirrored, Easily P-Tapped, N-Taped

  34. 1:2 DEMUX to send data between columns BL/BLB/PRECH Generator BL/BLB drivers TXGates to disconnect BL/BLB from their drivers We discussed the components laid out in this image. We should’ve annotated the picture to make it easier to understand when looking at just the plot w/o the verbal presentation. 8192 bitcells Sense Amps with OUT going into a min. sized buffer. 1 SRAM Block Advantages: Symmetrical, Pitched Match (All separate components fit in nicely), 1 VDD/VSS source for Bit Cell Array, Can be mirrored, Most inputs, Inputs come in from Left and Top side

  35. Block Pair 3 2:1 MUXes. 1 per SRAMblock selects between correct column bit(i.e. word 0 or word1) in specific block. 1 selects word from correct block

  36. Block diagram of 3 2:1 MUXes in Block Pair on previous page

  37. Block Pair Merger

  38. Block Pair

  39. How we laid out our devices

  40. Layout Diagram(Connections Removed to Reduce Clutter) 8 SRAM Blocks Blue lines are metal sending data chosen from a “block pair” to another 2:1 MUX in between 2 block pairs 7:128 Row Decoder

  41. CLK WRITE 1 Write READ 5 Reads GlobalWL BlockSelect localWL PRECH TXGates only on during write or PRECH TXGC Bumps in Q due to reading a 0 Q QB Dout10 Dout17

  42. Metric Breakdown • Metric: 8.827 x 10-29 J s2 mm2 W • 1 bitcell area: 1.0034 µm2 • Total area: 0.135815 mm2 • Total Energy: 2.519 nJ • Read Delay: 1.485 ns • Write Delay: 0.262 ns • Total Delay: 1.485 ns • Idle Power: 117 mW

  43. Questions?

More Related