1 / 33

Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk

Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk. Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab. Takeaway. Observations: Employing more and more flash chips is not a promising solution

nadda
Download Presentation

Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab

  2. Takeaway • Observations: • Employing more and more flash chips is not a promising solution • Unbalanced flash chip utilization and low parallelism • Challenges: • The degree of parallelism and utilization depends highly on incoming I/O request patterns • Our approach: • Sprinkles I/O request based on internal resource layout rather than the order imposed by a storage queue • Commits more memory requests to a specific internal flash resource

  3. Revisiting NAND Flash Performance • Flash Interface (ONFI 3.0) • SDR : 50 MB/sec • NV-DDR : 200 MB/sec • NV-DDR2 : 533 MB/sec Memory Cell Performance (excluding data movement) • READ: 20 us ~ 115 us • WRITE: 200 us ~ 5 ms • ONFI 4.0  800 MB/sec • WRITE  1.6 ~ 20 MB/sec • READ  70 ~ 200 MB/sec

  4. Revisiting NAND Flash Performance • PCI Express (single lane) • 2.x: 500 MB/sec • 3.0: 985 MB/sec • 4.0: 1969 MB/sec • PCIe4.0 (16-lanes) 31.51 GB/sec • ONFI 4.0  800 MB/sec

  5. Revisiting NAND Flash Performance 200 MB/s 800 MB/s 31 GB/s Performance Disparity (even under an ideal situation)

  6. How can we reduce the performance disparity?

  7. Internal Parallelism A Single Host-level I/O Request

  8. Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

  9. Many-chip SSD Performance Performance stagnates

  10. Utilization and Idleness Idleness keeps growing Utilization sharply goes down

  11. I/O Service Routine in a Many-chip SSD Memory Requests: data size is the same as atomic flash I/O unit size Challenge: I/O access patterns and sizes are all determined by host-side kernel modules A flash transaction should be decided before entering the execution stage Out-of-order Scheduling System- and Flash-level Parallelism

  12. Challenge Examples • Virtual Address Scheduler • Physical Address Scheduler

  13. Virtual Address Scheduler(VAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Physical Offset Idle Tail Collision Tail Collision Physical Offset

  14. Physical Address Scheduler (PAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Tail Collision Tail Collision Tail Collision Collision Tail Collision Tail Collision Physical Offset Pipelining

  15. Observations • # of chips < # of memory requests • The total number of chips is relatively fewer than the total number of memory request coming from different I/O requests • There exist many requests heading to the same chip, but to different internal resources • Multiple memory requests can be built into a high FLP transaction if we could change commit order

  16. Insights • Stalled memory requests can be immediately served • If the scheduler could compose the requests beyond the boundary of I/O requests and commit them regardless of the order of them • It can have more flexibility in building a flash transaction with high FLP • If the scheduler can commit them targeting different flash internal resources

  17. Sprinkler • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout • Improving transactional-locality • Supply many memory requests to underlying flash controllers

  18. RIOS: Resource-driven I/O Scheduling 1 2 3 • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8

  19. RIOS: Resource-driven I/O Scheduling 1 2 3 • RIOS • Out-of-Order Scheduling • Fine Granule Out-of-Order Execution • Maximizing Utilization 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8

  20. FARO: FLP-Aware Request Over-commitment • High Flash-Level Parallelism (FLP) • Bring as many requests as possible to flash controllers, allowing them to coalesce many memory requests into a single flash transaction • Consideration • A careless memory requests over-commitment can introduce more resource contention

  21. FARO: FLP-Aware Request Overcommitment RIOS C3 • Overlap Depth • The number of memory requests heading to different planes and dies, but the same chip • Connectivity • Maximum number of memory requests that belong to the same I/O request Overlap depth : 4 Connectivity : 2 Overlap depth : 4 Connectivity : 1 FARO

  22. Sprinkler 1 2 3 4 5 C0 C3 C6 C1 C4 C7 C2 C5 C8 Pipelining

  23. Evaluations • Simulation • NFS (NANDFlashSim) http://nfs.camelab.org • 64 ~ 1024 flash chips -- dual die, four plane (our SSD simulator simultaneously executes 1024 NFS instances) • Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms, read: 20 us) • Workloads • Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj) • High transactional locality workloads: cfs2, msnfs2~3 • Schedulers • VAS : Virtual Address Scheduler, using FIFO • PAS: Physical Address Scheduler, using extra queues • SPK1: Sprinkler, using only FARO • SPK2: Sprinkler, using only RIOS • SPK3: Sprinkler, using both FARO and RIOS

  24. Throughput Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance 300 MB/s improvement [Bandwidth] 4x improvement [IOPS]

  25. I/O and Queuing Latency SPK1 itself cannot secure enough memory requests and still have parallelism dependency Large req.size SPK2 is worse than SPK1 SPK1 is worse than PAS [Avg. Latency] SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively. [Queue Stall Time]

  26. Idleness Evaluation SPK1 shows worse inter-idleness reduction than PAS [Inter-chip Idleness] SPK1 shows better intra-idleness reduction than PAS [Intra-chip Idleness] When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

  27. Conclusion and Related Work • Conclusion: • Sprinkler relaxes the parallelism dependency by sprinkling memory requests based on the underlying internal resources • Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller • Related work: • Balancing timing constraints, fairness, and different dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07] • Physical Address Scheduling [ISCA’12 TC’11]

  28. Parallelism Breakdown [SPK2 RIOS-only] [VAS] [SPK1 FARO-only] [SPK3 Sprinkler]

  29. # of Transactions [64-chips] [1024-chips]

  30. Time Series Analysis

  31. GC

  32. Sensitivity Test [64-chips] [256-chips] [1024-chips]

More Related