Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas Computer Architecture and Memory systems Lab

Takeaway • Observations: • Employing more and more flash chips is not a promising solution • Unbalanced flash chip utilization and low parallelism • Challenges: • The degree of parallelism and utilization depends highly on incoming I/O request patterns • Our approach: • Sprinkles I/O request based on internal resource layout rather than the order imposed by a storage queue • Commits more memory requests to a specific internal flash resource

Revisiting NAND Flash Performance • Flash Interface (ONFI 3.0) • SDR : 50 MB/sec • NV-DDR : 200 MB/sec • NV-DDR2 : 533 MB/sec Memory Cell Performance (excluding data movement) • READ: 20 us ~ 115 us • WRITE: 200 us ~ 5 ms • ONFI 4.0  800 MB/sec • WRITE  1.6 ~ 20 MB/sec • READ  70 ~ 200 MB/sec

Revisiting NAND Flash Performance • PCI Express (single lane) • 2.x: 500 MB/sec • 3.0: 985 MB/sec • 4.0: 1969 MB/sec • PCIe4.0 (16-lanes) 31.51 GB/sec • ONFI 4.0  800 MB/sec

Revisiting NAND Flash Performance 200 MB/s 800 MB/s 31 GB/s Performance Disparity (even under an ideal situation)

How can we reduce the performance disparity?

Internal Parallelism A Single Host-level I/O Request

Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases

Many-chip SSD Performance Performance stagnates

Utilization and Idleness Idleness keeps growing Utilization sharply goes down

I/O Service Routine in a Many-chip SSD Memory Requests: data size is the same as atomic flash I/O unit size Challenge: I/O access patterns and sizes are all determined by host-side kernel modules A flash transaction should be decided before entering the execution stage Out-of-order Scheduling System- and Flash-level Parallelism

Challenge Examples • Virtual Address Scheduler • Physical Address Scheduler

Virtual Address Scheduler(VAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Physical Offset Idle Tail Collision Tail Collision Physical Offset

Physical Address Scheduler (PAS) 1 2 3 4 5 CHIP 3 (C3) C0 C3 C6 C1 C4 C7 Physical Offset Physical Offset C2 C5 C8 Tail Collision Tail Collision Tail Collision Collision Tail Collision Tail Collision Physical Offset Pipelining

Observations • # of chips < # of memory requests • The total number of chips is relatively fewer than the total number of memory request coming from different I/O requests • There exist many requests heading to the same chip, but to different internal resources • Multiple memory requests can be built into a high FLP transaction if we could change commit order

Insights • Stalled memory requests can be immediately served • If the scheduler could compose the requests beyond the boundary of I/O requests and commit them regardless of the order of them • It can have more flexibility in building a flash transaction with high FLP • If the scheduler can commit them targeting different flash internal resources

Sprinkler • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout • Improving transactional-locality • Supply many memory requests to underlying flash controllers

RIOS: Resource-driven I/O Scheduling 1 2 3 • Relaxing the parallelism dependency • Schedule and build memory requests based on the internal resource layout 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8

RIOS: Resource-driven I/O Scheduling 1 2 3 • RIOS • Out-of-Order Scheduling • Fine Granule Out-of-Order Execution • Maximizing Utilization 4 5 6 7 8 C0 C3 C6 9 10 11 C1 C4 C7 C2 C5 C8

FARO: FLP-Aware Request Over-commitment • High Flash-Level Parallelism (FLP) • Bring as many requests as possible to flash controllers, allowing them to coalesce many memory requests into a single flash transaction • Consideration • A careless memory requests over-commitment can introduce more resource contention

FARO: FLP-Aware Request Overcommitment RIOS C3 • Overlap Depth • The number of memory requests heading to different planes and dies, but the same chip • Connectivity • Maximum number of memory requests that belong to the same I/O request Overlap depth : 4 Connectivity : 2 Overlap depth : 4 Connectivity : 1 FARO

Sprinkler 1 2 3 4 5 C0 C3 C6 C1 C4 C7 C2 C5 C8 Pipelining

Evaluations • Simulation • NFS (NANDFlashSim) http://nfs.camelab.org • 64 ~ 1024 flash chips -- dual die, four plane (our SSD simulator simultaneously executes 1024 NFS instances) • Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms, read: 20 us) • Workloads • Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj) • High transactional locality workloads: cfs2, msnfs2~3 • Schedulers • VAS : Virtual Address Scheduler, using FIFO • PAS: Physical Address Scheduler, using extra queues • SPK1: Sprinkler, using only FARO • SPK2: Sprinkler, using only RIOS • SPK3: Sprinkler, using both FARO and RIOS

Throughput Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance 300 MB/s improvement [Bandwidth] 4x improvement [IOPS]

I/O and Queuing Latency SPK1 itself cannot secure enough memory requests and still have parallelism dependency Large req.size SPK2 is worse than SPK1 SPK1 is worse than PAS [Avg. Latency] SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively. [Queue Stall Time]

Idleness Evaluation SPK1 shows worse inter-idleness reduction than PAS [Inter-chip Idleness] SPK1 shows better intra-idleness reduction than PAS [Intra-chip Idleness] When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)

Conclusion and Related Work • Conclusion: • Sprinkler relaxes the parallelism dependency by sprinkling memory requests based on the underlying internal resources • Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller • Related work: • Balancing timing constraints, fairness, and different dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07] • Physical Address Scheduling [ISCA’12 TC’11]

Parallelism Breakdown [SPK2 RIOS-only] [VAS] [SPK1 FARO-only] [SPK3 Sprinkler]

# of Transactions [64-chips] [1024-chips]

Time Series Analysis

Sensitivity Test [64-chips] [256-chips] [1024-chips]

Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk

Sprinkler : Maximizing Resource Utilization in Many-Chip Solid State Disk

Presentation Transcript

Human Resource Best Practices Part I – A Solid Foundation

Augmenting Sprinkler Systems

SOLID STATE PHYSICS

Basic Operating System Concepts

Maximizing Impact and Return on Investments in STD/HIV Prevention Programs

ESF-8: Resource Roundup

What is an “ SoC ”?

Maximizing Genetic Information

Designing On-chip Memory Systems for Throughput Architectures

CRYSTALLINE STATE

Solid State Detectors and Instrumentation

Organizing and Managing Your Hard Disk

Carman scan for EMS professor

MICROCONTROLLERS

CS 2204

SOLID and Other Principles

Utilization Management

What is an “ SoC ”?

Chapter 3 Solid-State Diodes and Diode Circuits

Colorimetry of Solid State Light Sources