240 likes | 411 Views
Challenges in Getting Flash Drives Closer to CPU. Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas. Take-away. Leveraging PCIe bus as storage interface ≠ conventional memory system interconnects ≠ thin storage interfaces
E N D
Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas
Take-away • Leveraging PCIe bus as storage interface • ≠ conventional memory system interconnects • ≠ thin storage interfaces • Requires new SSD architecture and storage stack • Motivation: there are not many studies focusing on the system characteristics of these emerging PCIe SSD platforms. • Contributions: we quantitatively analyze the challenges faced by PCIe SSDs in getting flash memory closer to CPU • Memory consumption • Computation resource requirement • Performance as a shared storage system • Latency impact on their storage-level queuing mechanisms
Bandwidth Trend • Bandwidth improvement (150MB/s ~ 600MB/s)
Bandwidth Trend SSDs begin to blur the distinction between block and memory access semantic devices • SSDs have improved their bandwidth 4x
Flash Storage Migration Core Core Core Core Core Core PCIe interface is by far one of the easiest ways to integrate flash memory into the processor-memory complex Interface Bottleneck Taking SSDs out from the I/O controller hub and locating them as close to the CPU side as possible Flash Flash Flash Flash Flash Flash
Flash Integration • Bridge-based PCIe SSD (BSSD) • From-scratch PCIe SSD (FSSD)
Bridge-based PCIe SSD (BSSD) multiple traditional SAS/SATA SSD controllers Bridge controller exposing an aggregated SAS/SATA SSD performance RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter
Bridge-based PCIe SSD (BSSD) PROS High Compatibility Fast Development Process CONS Redundant Control Logics Computational Overheads En-decoding Overheads RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter
From-scratch PCIe SSD (FSSD) • PCIe endpoints (EPs) has upstream and downstream buffers, which control in-bound and out-bound I/O requests • PCIe EPs and switch are implemented as a form of native PCIe controller Point-to-point PCIe link network • FSSD has been built bottom to top by directly interconnecting the NAND flash interface and the external PCIe link RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter
From-scratch PCIe SSD (FSSD) PROS Highly scalable • Exposing flash performance CONS Protocol design/implementation Tailoring SW/HW Resource competition RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter
Flash Software Stack File System Database Host Block Storage Layer HBA Device Driver Logical Block I/O Interface • Buffer cache • Address mapping • Wear-leveling Host Interface Layer (NVMHC) Storage Flash Software (FTL) Hardware Abstraction Layer
Experimental Setup • Host configuration • Quad Core i7 Sandy Bridge 3.4GHz • External extra HDD (for logging the footprints) • 16GB Memory (4GB DDR3-1333 DIMM * 4) most performance values observed with FSSD are about 40% better than BSSD
Tool • Synthesized micro-benchmark workloads of Iometer • Modified Iometer • Time series evaluation: a script that generates log-data per every sec. • Memory usage evaluation: added a module in calling system API GlobalMemoryStatusEx() into Iometer
Memory Usage (Overall) • Physical memory consumption FSSD consumes 2.5x more memory space 0.6 GB (BSSD) 0.6 GB (BSSD) [Writes] [Reads] FSSD consumes 3x~16x more memory space • Request sizes (1 ~ 512 sectors )
Memory Usage (BSSD) • submits I/Os whenever device is available • 128 entries BSSD requires only 0.6GB memory space regardless of the I/O type and size. • Memory consumption
Memory Usage (FSSD) 2GB memory requirements As the I/O process progresses, the amount of memory usage keeps increasing in logarithmic fashion and reach 10GB 10GB memory usage to manage only the underlying SSD may not be acceptable in many applications
CPU Usage (BSSD) • Host-level CPU usages BSSD consumes 15%~30% of total CPU cycles for handling I/O requests • Time series
CPU Usage (FSSD) I/O service with queue-mode operation requires 50% more CPU cycles 60% of the cycles on host-side CPU FSSD requires much higher CPU usages (50%~ 90%) A CPU usage over 60% for just I/O processing might be able to degrade overall system performance
FSSD performance (multi-threads) worse than four workers by 118% FSSD offers very stable and predictable performance • Latency • Throughput worse than single workers by 289 % 2.2x better than single worker
FSSD resource usages (multi-threads) Require 134% more memory space Require 201% more computation resources • Memory consumption • CPU usages the advantage decreases because of high memory requirement and CPU usages
BSSD resource usages (multi-threads) offers similar memory requirements (less than 0.66GB) irrespective of # of threads offers similar CPU usages (less than 30%) irrespective of # of threads • Memory consumption • CPU usages
BSSD performance (multi-threads) worse than four workers by 289% There exist no differences with varying number of workers • Latency • Throughput worse than single workers by 708 % Write-cliff occurs (garbage collection impact)
Latency Impact on a Queuing Method worse than a legacy req. by 106x worse than a legacy req. by 99x • FSSD • BSSD worse than a legacy req. by 86x worse than a legacy req. by 184x
Summary • Design trade-off between performance and resource utilization • All-Flash-Array • Data-center/HPC local node SSD • Software stack optimization • Co-operative approaches • Unified/direct file systems • Garbage collection schedulers • Queue control • We are constructing an environment for automated SSD evaluation in camelab.org