DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn StateASPLOS 2009 Shimin Chen, Big Data Reading Group

Introduction • Goal: improve performance of flash-based devices for workloads with random writes • New Proposal: DFTL (Demand-based FTL) • FTL: flash translation layer) • FTL maintains a mapping table: virtual  physical address

Outline • Introduction • Background on FTL • Design of DFTL • Experimental Results • Summary

Basics of Flash Memory • OOB (out-of-band) area: • ECC • Logical page number • State: erased/valid/invalid

Flash Translation Layer • Maintain mapping: • Virtual address (exposed to upper level) physical address (on flash) • Use a small, fast SRAM for storing this mapping • Hide erase operation to the above • Avoiding in-place update • Updating a clean page • Performing garbage collection and erasure • Note: • OOB has the physical  virtual mapping • FTL virtual  physical mapping can be rebuilt (at restart)

Page-Level FTL • Keep page to page mapping table • Pro: can map any logical page to any physical page • Efficient flash page utilization • Con: mapping table is large • E.g., 16GB flash, 2KB flash page, requires 32MB SRAM • As flash size increases, SRAM size must scale • Too expensive!

Block-Level FTL • Keep block to block mapping • Pro: small • Mapping table size reduced by a factor of (block size / page size) ~ 64 times • Con: page number offset within a block is fixed • Garbage collection overheads grow

Hybrid FTLs (a generic description) LPN: Logical Page Number • Data blocks: block-level mapping • Log/update blocks: page-level mapping

Operations in Hybrid FTLs • Update on data blocks: write to log blocks • Log region is small (e.g., 3% of total flash size) • Garbage collection (gc) • When no free log blocks are available, invoke gc to merge log blocks with data blocks

Full Merge can be Recursive thus Expensive • Often resulted from random writes

DFTL Idea • Avoid expensive full merges totally • Do not use log blocks at all • Idea: • Use page-level mapping • Keep the full mapping on flash to reduce SRAM use • Exploit temporal locality in workloads • Dynamically load / unload page-level mappings into SRAM

DFTL Architecture Global mapping table

DFTL Address Translation Case 1: request_LPN hits in cache mapping table Done. Retrieve the mapping directly Global mapping table

DFTL Address Translation Case 2: a miss in cache mapping table (CMT) If (CMT is not full) then look up GDT read the translation page fill in CMT entry goto case 1 Global mapping table

DFTL Address Translation Case 3: a miss in cache mapping table (CMT) If (CMT is full) then select CMT entry to evict (~LRU) write back dirty entry goto case 2 Global mapping table

Address Translation Cost • Worst case cost (case 3) • 2 translation page reads • 1 translation page write • Temporal locality: • More hits, fewer misses, fewer evictions • CMT contains multiple mappings in a single translation page • Batch updates

Data Read • Address translation: LPN  PPN • Read the data page PPN

Writes • Current data block • Updated data page is appended into current data block • Current translation block • Updated translation page is appended into current translation block • Until number of free blocks < GC_threshold

Garbage Collection • Select a victim block [15] Kawaguchi et al. 1995

Garbage Collection • If selected victim block is a translation block • Copy valid page to a free translation block • Update GTD (global translation directory) • If selected victim block is a data block • Copy valid page to a free data block • Update the page-level translation for each data block • Possibly update CMT entry (if so, done) • Locate translation page, update it, change GTD • Batch update opportunities if multiple page-level translations are in the same translation page

Benefits • Page-level mapping: • No expensive full merge operations • Better random write performance as a result • But random writes are still worse than sequential • more CMT misses, more translation page writes • Data pages in a block are more scattered • GC costs higher: less opportunities for batch updates

FTL Schemes Implemented • FlashSim simulator • The authors enhanced DiskSim • Block-based FTL • A state-of-the-art hybrid FTL (FAST FTL) • DFTL • An idealized page-based FTL

Experimental Setup • Model 32GB flash memory, 2KB page, 128KB block • Timing is displayed in Table 1

Traces Used in Experiments

Block Erases Baseline: idealized page-level FTL

Extra Read/Write Operations 63% CMT hits for financial

Response Times (from tech report)

CDF

CDF address translation overhead shows up

CDF FAST has a long tail

Figure 10. Microscopic analysis

Summary • Demand-based page-level FTL • Two-level page table: • (Flash) Translation page: LPN to PPN entries • (SRAM) Global translation directory: translation page entries • Mapping cache in SRAM

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings