370 likes | 619 Views
Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk. Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab. Takeaway. Observations: Employing more and more flash chips is not a promising solution
E N D
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab
Takeaway • Observations: • Employing more and more flash chips is not a promising solution • Unbalanced flash chip utilization and low parallelism • Challenges: • The degree of parallelism and utilization depends highly on incoming I/O request patterns • Our approach: • Sprinkles I/O request based on internal resource layout rather than the order imposed by a storage queue • Commits more memory requests to a specific internal flash resource
Revisiting NAND Flash Performance • Flash Interface (ONFI 3.0) • SDR : 50 MB/sec • NV-DDR : 200 MB/sec • NV-DDR2 : 533 MB/sec Memory Cell Performance (excluding data movement) • READ: 20 us ~ 115 us • WRITE: 200 us ~ 5 ms • ONFI 4.0 800 MB/sec • WRITE 1.6 ~ 20 MB/sec • READ 70 ~ 200 MB/sec
Revisiting NAND Flash Performance • PCI Express (single lane) • 2.x: 500 MB/sec • 3.0: 985 MB/sec • 4.0: 1969 MB/sec • PCIe4.0 (16-lanes) 31.51 GB/sec • ONFI 4.0 800 MB/sec
Revisiting NAND Flash Performance 200 MB/s 800 MB/s 31 GB/s Performance Disparity (even under an ideal situation)
Internal Parallelism A Single Host-level I/O Request
Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases
Many-chip SSD Performance Performance stagnates
Utilization and Idleness Idleness keeps growing Utilization sharply goes down
I/O Service Routine in a Many-chip SSD Memory Requests: data size is the same as atomic flash I/O unit size Challenge: I/O access patterns and sizes are all determined by host-side kernel modules A flash transaction should be decided before entering the execution stage Out-of-order Scheduling System- and Flash-level Parallelism
Challenge Examples • Virtual Address Scheduler • Physical Address Scheduler
Virtual Address Scheduler(VAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Physical Offset Idle Tail Collision Tail Collision Physical Offset
Physical Address Scheduler (PAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Tail Collision Tail Collision Tail Collision Collision Tail Collision Tail Collision Physical Offset Pipelining
Observations • # of chips < # of memory requests • The total number of chips is relatively fewer than the total number of memory request coming from different I/O requests • There exist many requests heading to the same chip, but to different internal resources • Multiple memory requests can be built into a high FLP transaction if we could change commit order
Insights • Stalled memory requests can be immediately served • If the scheduler could compose the requests beyond the boundary of I/O requests and commit them regardless of the order of them • It can have more flexibility in building a flash transaction with high FLP • If the scheduler can commit them targeting different flash internal resources
Sprinkler • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout • Improving transactional-locality • Supply many memory requests to underlying flash controllers
RIOS: Resource-driven I/O Scheduling 1 2 3 • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8
RIOS: Resource-driven I/O Scheduling 1 2 3 • RIOS • Out-of-Order Scheduling • Fine Granule Out-of-Order Execution • Maximizing Utilization 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8
FARO: FLP-Aware Request Over-commitment • High Flash-Level Parallelism (FLP) • Bring as many requests as possible to flash controllers, allowing them to coalesce many memory requests into a single flash transaction • Consideration • A careless memory requests over-commitment can introduce more resource contention
FARO: FLP-Aware Request Overcommitment RIOS C3 • Overlap Depth • The number of memory requests heading to different planes and dies, but the same chip • Connectivity • Maximum number of memory requests that belong to the same I/O request Overlap depth : 4 Connectivity : 2 Overlap depth : 4 Connectivity : 1 FARO
Sprinkler 1 2 3 4 5 C0 C3 C6 C1 C4 C7 C2 C5 C8 Pipelining
Evaluations • Simulation • NFS (NANDFlashSim) http://nfs.camelab.org • 64 ~ 1024 flash chips -- dual die, four plane (our SSD simulator simultaneously executes 1024 NFS instances) • Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms, read: 20 us) • Workloads • Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj) • High transactional locality workloads: cfs2, msnfs2~3 • Schedulers • VAS : Virtual Address Scheduler, using FIFO • PAS: Physical Address Scheduler, using extra queues • SPK1: Sprinkler, using only FARO • SPK2: Sprinkler, using only RIOS • SPK3: Sprinkler, using both FARO and RIOS
Throughput Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance 300 MB/s improvement [Bandwidth] 4x improvement [IOPS]
I/O and Queuing Latency SPK1 itself cannot secure enough memory requests and still have parallelism dependency Large req.size SPK2 is worse than SPK1 SPK1 is worse than PAS [Avg. Latency] SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively. [Queue Stall Time]
Idleness Evaluation SPK1 shows worse inter-idleness reduction than PAS [Inter-chip Idleness] SPK1 shows better intra-idleness reduction than PAS [Intra-chip Idleness] When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)
Conclusion and Related Work • Conclusion: • Sprinkler relaxes the parallelism dependency by sprinkling memory requests based on the underlying internal resources • Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller • Related work: • Balancing timing constraints, fairness, and different dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07] • Physical Address Scheduling [ISCA’12 TC’11]
Parallelism Breakdown [SPK2 RIOS-only] [VAS] [SPK1 FARO-only] [SPK3 Sprinkler]
# of Transactions [64-chips] [1024-chips]
Sensitivity Test [64-chips] [256-chips] [1024-chips]