440 likes | 649 Views
Enhancing recovery and ramp-up performance of DBMS when using an SSD buffer-pool extension. Wang Jiangtao 2013-10-18. Outline. Introduction SSD-based extension buffer Enhancing recovery by SSD Two related work Enhancing recovery using…[DaMoN2011] Fast peak-to-peak ….[ICDE2013] Summary.
E N D
Enhancing recovery and ramp-up performance of DBMS when using an SSD buffer-pool extension Wang Jiangtao 2013-10-18
Outline • Introduction • SSD-based extension buffer • Enhancing recovery by SSD • Two related work • Enhancing recovery using…[DaMoN2011] • Fast peak-to-peak ….[ICDE2013] • Summary
Evolution of HDD • Hard disk drive (HDD) • Access rates have been flat for ~13 years • Disk density growth projection bleak • Capacity growth is now about to flatten significantly • Power savings not realized
Solid State Drive • Solid State Drive (SSD) • A semiconductor device • Mechanical components free • 3D NAND flash memory • Technical merits • High IOPS(>50000) • High bandwidth >500MB/s) • Low power: 0.06 (idle)~2.4w (active) • Shock resistance SSD
Integrating SSD and HDD • Background • Performance depends heavily on memory, I/O bandwidth, access latency(web server) • SSD at capacity not going to be reality • Price($/GB):RAM>>SSD>Disk • Read>>write(SSD) • Only a small amount of data is hot! • Cost-effectiveness is the primary factor for large data center • ……
Outline • Introduction • SSD-based extension buffer • Enhancing recovery by SSD • Two related work • Enhancing recovery using…[DaMoN2011] • Fast peak-to-peak ….[ICDE2013] • Summary
SSD as cache-buffer • Basic Framework B. Debnath, etc.Flashstore: high throught persistent key-value store. VLDB 2010 J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011. W.H. Kang,etc. Flash-based Extended Cache for Higher Throughput and Faster Recovery. VLDB 2012 J. Do, etc. Fast peak-to-peak behavior with SSD buffer pool. ICDE2013
Applications in Industry • Intel (Differentiated Storage Services) • Intel的SSD缓存解决方案,是将所需文件临时镜像缓存在SSD中。 • Apple(Fusion Drive) • Fusion Drive包含SSD 和磁盘 • 使用频繁的 app、文档、照片和其他文件存储在闪存上, • 所有的写入操作都在SSD,不常用的内容转移到硬盘 • Oracle Exadata (Database Machine) • 综合了可扩展的服务器和存储、InfiniBand网络、智能存储、PCI闪存、智能内存高速缓存和混合列式压缩等,实现了软硬件一体化的数据管理。 • 智能闪存缓存通过将经常访问的热数据透明地缓存在高速固态存储系统中,来解决磁盘随机I/O瓶颈问题
Outline • Introduction • SSD-based extension buffer • Enhancing recovery by SSD • Two related work • Enhancing recovery using…[DaMoN2011] • Fast peak-to-peak ….[ICDE2013] • Summary
Recovery for SSD-based cache system • Problem definition • A small amount of SSD can ameliorate a large fraction of random I/O. • A long time is needed when restarting the DBMS from a shutdown or a crash. • There has not been much emphasis on exploitation of the persistency of SSD.
Recovery for SSD-based cache system • Challenge • How to improve the performance of recovery without negatively impacting peak performance. • How to ensure the correctness of DBMS when executing recovery algorithm.
Outline • Introduction • SSD-based extension buffer • Enhancing recovery by SSD • Two related work • Enhancing recovery using…[DaMoN2011] • Fast peak-to-peak ….[ICDE2013] • Summary
Motivation • Recovery is itself a random I/O intensive process. • The pages that need to be read and written during recovery may be scattered over various parts of the disk. • Preserve the state of the SSD buffer pool so that it can be used during crash recovery. • Provide a warm buffer pool restart
TAC (VLDB2010) • TAC • Wirte-through • Temperature-based data prefetch M. Canim,etc. SSD bufferpool extensions for database systems. VLDB2010
Implementing Recovery • Metadata persistence • Store some SSD Buffer pool metadata on the persistent SSD storage • Mapping information synchronization • When a new page is admitted to the SSD buffer pool and an old page is evicted, the slot table must be updated. • When a dirty page is evicted from the RAM-resident buffer pool, no modifications of the slot table are required.
Recovery for TAC • Correctness for TAC • Initially be invalidated • The slot is updated after the write is finished • Missing some valid data on SSD
Experiment Results • Experiment setup • 500 warehouse(TPC-C) • The RAM was kept at 2.0% of the database size. Impact of metadata writes Impact on logging
Experiment Results • Crash performance Restart performance
Summary for recovery in TAC • The experiment is very sufficient, and the analysis is profound. • However, the metadata file is small(23MB), the size ratio between SSD and RAM is 3(3.6G/1,2G) • The cost of synchronization is relatively low
Motivation • With an SSD buffer-pool, a DBMS still treats the disks as the permanent “home” of data. • Such scheme have a long “peak-to-peak interval” when restarting a DBMS. • We need a fast mechanism to reduce the restart and ramp-up time
BackgroundTwo SSD buffer-pool extension designs • DW • Write-through • LC • Write-back J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.
BackgroundTwo SSD buffer-pool extension designs • Data structure • SSD buffer table J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.
BackgroundTwo SSD buffer-pool extension designs • Data structure • SSD buffer table J. Do,etc. Turbocharg-ing DBMS Buffer Pool Using SSDs. SIGMOD 2011.
Background Recovery in SQL server2012 • Data structure • Transaction log • Update log (pageID,prepageLSN,…), BUF_WRITE log ….. • Dirty page table • Store information about dirty pages in main memory • (pageID,recLSN,lastLSN…) • Transaction table • Stores information about active transactions • (beginLSN,endLSN,…) • …… • Checkpoint
Background Recovery in SQL server • Recovery • Analysis phase • Build dirty page table • Build transaction table • Build lock table • …… • Redo phase • Undo phase
Restart design • Some Pitfalls in using the SSD after a restart • Different version data in SSD and disk • In DW, delay modifying the FC until both the SSD write and the disk write have completed. • In LC, a BUF_WRITE log is generated after the lazy cleaner finishes copying a dirty SSD page to the disks. • In LC, oldestDirtyLSN is the oldest recLSN of the dirty pages in RAM and in the SSD buffer pool.
MMR design • Main idea • Stores the mapping table in SSD • Synchronously updates mapping table • Hardening the FC fields • State, pageID, lastusetime, nexttolastusetime • When to harden • When a clean SSD frame is about to be replaced, flush the state change. • Minimize the number of flushes • Recover the SSD buffer table • Recover the state of FC (FREE, CLEAN, or DIRTY) • Rebuilt the data structures • Recover recLSN of FC after the analysis phase
LBR design • Main idea • Check the SSD buffer table during a DBMS checkpoint • Log the update for SSD buffer table through SSD log record • Figure out the protocol to checkpoint, log, and recover • Hardening the FC fields • State, pageID, lastusetime, nexttolastusetime • SSD Log record • SSD_CHKPT: • hardening the states of every 64 FCs • SSD_WRITE_INVALIDATE: • overwrite a clean SSD page when there no available free SSD frame • SSD_POST_WRITE: • after a page is written to SSD • SSD_LAZY_CLEANED: • after a dirty SSD page is cleaned
LBR design • When to harden • only the SSD_PRE_WRITE_INVALIDATE log record must be flushed to disk, before the thread that generates the log record can continue. • Group Writing Optimization • Recovery • SSD_CHKPT: • If a FC is DIRTY, Recover recLSN field, update SSD hash table • SSD_WRITE_INVALIDATE: • Invalidate the corresponding FC • SSD_POST_WRITE: • the same as the one used in the processing of an SSD_CHKPT log • SSD_LAZY_CLEANED: • the FC state is changed from LAZYCLEANING to CLEAN
LVR design • Main idea • Asynchronously harden the SSD buffer table • Dealing with invalid SSD buffer table records recovered from the recent flush • Ensure two properties • The databases should be consistent, if the design chooses to reuse a page in the SSD buffer pool upon a restart: • The PageID of a FC is different fromthe actual SSD page • The databases should be consistent, if the design chooses to discard a page in the SSD buffer pool upon a restart, even if the SSD page is newer than the disk version • oldestdirtyLSN
LVR design • Hardening the FC fields • State,pageID,lastUseTime,NextTolastUseTime,blank,beforeHardeningLSN • The FC flusher thread • repeatedly scans the SSD buffer table in chunks, and hardens the FCs
LVR design • Checkpoint • make sure that the FC flusher thread finishes a complete pass of hardening the SSD buffer table during a checkpoint. • Recovering from shutdown
LVR design • Checkpoint • make sure that the FC flusher thread finishes a complete pass of hardening the SSD buffer table during a checkpoint. • Recovering from crash
Experiment results • Experiment setup • 24GB RAM,140GB SSD,200GB database size • SQL Server 2012 • Dirty fraction:20% • Throughput after restart TPC-E TPC-C
Experiment results • TPC-C Evaluation • Peak-to-peak interval restarting from a shutdown. restarting from a crash.
Experiment results • TPC-E Evaluation • Peak-to-peak interval restarting from a shutdown. restarting from a crash.
Outline • Introduction • SSD-based extension buffer • Enhancing recovery by SSD • Two related work • Enhancing recovery using…[DaMoN2011] • Fast peak-to-peak ….[ICDE2013] • Summary
Summary • Basic requirement • Ensure the consistency and correctness of DBMSs • Minimize the cost of hardening mapping information • Design different recovery algorithm for cache policies • Various pitfalls • Log VS. Metadata files • log-based scheme require a larger space • high complexity when designing recovery algorithm
Summary • Emerging memory technology • Hardening metadata to PCM synchronously • Scan PCM and rebuilt mapping table for SSD • Design principle • Finer-grained access granularity • minimizing PCM writes • Designing index to reduce the performance loss CPU L1/L2 Cache DRAM SSD data PCM metdata
Summary • Asynchronously hardening • The mapping file is crated in SSD • Each flash page is response for a SSD data area • Only harden the updated SSD data area • Alleviate the number of I/O • Quickly find the destination FC when recovery
Summary • Lower the cost of scan SSD table • Add a checkpoint for mapping information update • A log is used to record the recent checkpoint • Only Scan the related checkpoints update for metadata