230 likes | 368 Views
Flash Memory Database Systems and In-Page Logging. Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung). Flash Talk. COMPUTER SCIENCE DEPARTMENT.
E N D
Flash Memory Database Systems and In-Page Logging Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung) Flash Talk COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Magnetic Disk vs Flash SSD Champion for 50 years Intel X25-M Flash SSD 80GB 2.5 inch New challengers! Seagate ST340016A 40GB,7200rpm Samsung FlashSSD 128 GB 2.5/1.8 inch COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Past Trend of Disk From 1983 to 2003 [Patterson, CACM 47(10) 2004] Capacity increased about 2500 times (0.03 GB 73.4 GB) Bandwidth improved 143.3 times (0.6 MB/s 86 MB/s) Latency improved 8.5 times (48.3 ms 5.7 ms) COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
I/O Crisis in OLTP Systems I/O becomes bottleneck in OLTP systems Process a large number of small random I/O operations Common practice to close the gap Use a large disk farm to exploit I/O parallelism Tens or hundreds of disk drives per processor core (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core Adopt short-stroking to reduce disk latency Use only the outer tracks of disk platters Other concerns are raised too Wasted capacity of disk drives Increased amount of energy consumption Then, what happens 18 months later? To catch up with Moore’s law and balance CPU and I/O (Amdhal’s law), the number of spindles should be doubled again? COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Flash News in the Market • Sun Oracle Exadata Storage Server [Sep 2009] • Each Exadata cell comes with 384 GB flash cache • MySpace dumped disk drives [Oct 2009] • Went all flash, saving power by 99% • Google Chrome ditched disk drives [Nov 2009] • SSD is the key to 7-sec boot time • Gordon at UCSD/SDSC [Nov 2009] • 64 TB RAM, 256 TB Flash, 4 PB Disks • IBM hooked up with Fusion-io [Dec 2009] • SSD storage appliance for System X server line COMPUTER SCIENCE DEPARTMENT
Flash for Database, Really? Immediate benefit for some DB operations Reduce commit-time delay by fast logging Reduce read time for multi-versioned data Reduce query processing time (sort, hash) What about the Big Fat Tables? Random scattered I/O is very common in OLTP Slow random writes by flash SSD can handle this? COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Transactional Log SQL Queries System Buffer Cache Transaction (Redo) Log Temporary Table Space Rollback Segments Database Table space COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Commit-time Delay by Logging Write Ahead Log (WAL) A committing transaction force-writes its log records Makes it hard to hide latency With a separate disk for logging No seek delay, but … Half a revolution of spindle on average 4.2 msec (7200RPM), 2.0 msec (15k RPM) With a Flash SSD: about 0.4 msec SQL Buffer T1 T2 … Log Buffer Tn pi DB LOG • Commit-time delay remains to be a significant overhead • Group-commit helps but the delay doesn’t go away altogether. • How much commit-time delay? • On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction • TPC-B benchmark with 20 concurrent users. COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Rollback Segments SQL Queries System Buffer Cache Transaction (Redo) Log Temporary Table Space Rollback Segments Database Table space COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
MVCC Rollback Segments Multi-version Concurrency Control (MVCC) Alternative to traditional Lock-based CC Support read consistency and snapshot isolation Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL Rollback Segments Each transaction is assigned to a rollback segment When an object is updated, its current value is recorded in the rollback segment sequentially (in append-only fashion) To fetch the correct version of an object, check whether it has been updated by other transactions COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
MVCC Write Pattern • Write requests from TPC-C workload • Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB) • HDD moves disk arm very frequently • SSD has no negative effect from no in-place update limitation COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
MVCC Read Performance T1 T0 C … B 100 A A 50 A 100 200 T2 > 50 - Rollback segment (2) A: 100 > 100 - Rollback segment (1) A: 200 • To support MV read consistency, I/O activities will increase • A long chain of old versions may have to be traversed for each access to a frequently updated object • Read requests are scattered randomly • Old versions of an object may be stored in several rollback segments • With SSD, 10-fold read time reduction was not surprising COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Database Table Space SQL Queries System Buffer Cache Transaction (Redo) Log Temporary Table Space Rollback Segments Database Table space COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Workload in Table Space TPC-C workload Exhibit little locality and sequentiality Mix of small/medium/large read-write, read-only (join) Highly skewed 84% (75%) of accesses to 20% of tuples (pages) Write caching not as effective as read caching Physical read/write ratio is much lower that logical read/write ratio All bad news for flash memory SSD Due to the No in place update and Asymmetric read/write speeds In-Page Logging (IPL) approach [SIGMOD’07] COMPUTER SCIENCE DEPARTMENT
In-Page Logging (IPL) Key Ideas of the IPL Approach Changes written to log instead of updating them in place Avoid frequent write and erase operations Log records are co-located with data pages No need to write them sequentially to a separate log region Read current data more efficiently than sequential logging DBMS buffer and storage managers work together COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Design of the IPL Logging on Per-Page basis in both Memory and Flash • An In-memory log sector can be associated with a buffer frame in memory • Allocated on demand when a page becomes dirty • An In-flash log segment is allocated in each erase unit update-in-place Database Buffer in-memory data page (8KB) in-memory log sector (512B) Flash Memory 15 data pages (8KB each) Erase unit: 128KB log area (8KB): 16 sectors …. …. The log area is shared by all the data pages in an erase unit COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
IPL Write • Data pages in memory • Updated in place, and • Physiological log records written to its in-memory log sector • In-memory log sector is written to the in-flash log segment, when • Data page is evicted from the buffer pool, or • The log sector becomes full • When a dirty page is evicted, the content is not written to flash memory • The previous version remains intact • Data pages and their log records are physically co-located in the same erase unit update-in-place Update / Insert / Delete Sector : 512B physiological log Page : 8KB Buffer Pool Block : 128KB Flash Memory Data Block Area COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
IPL Read When a page is read from flash, the current version is computed on the fly Apply the “physiological action” to the copy read from Flash (CPU overhead) Pi Buffer Pool Re-construct the current in-memory copy Read from Flash • Original copy of Pi • All log records belonging to Pi(IO overhead) data area (120KB): 15 pages Flash Memory log area (8KB): 16 sectors COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
IPL Merge When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pages The current data pages are copied into a new erase unit Consumes, erases, and releases only one erase unit Can be Erased Merge 15 up-to-date data pages Physical Flash Block log area (8KB): 16 sectors clean log area Bold Bnew COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Industry Response • Common in Enterprise Class SSDs • Multi-channel, inter-command parallelism • Thruput than bandwidth, write-followed-by-read pattern • Command queuing (SATA-II NCQ) • Large RAM Buffer (with super-capacitor backup) • Even up to 1 MB per GB • Write-back caching, controller data (mapping, wear leveling) • Fat provisioning (up to ~20% of capacity) • Impressive improvement COMPUTER SCIENCE DEPARTMENT
EC-SSD Architecture Parallel/interleaved operations 8 channels, 2 packages/channel, 4 chips/package Two-plane page write, block erase, copy-back operations NAND NAND NAND NAND Flash package Flash package ARM9 ECC Main Controller Flash Controller Host I/F (SATA-II) 8 channels DRAM (128MB) COMPUTER SCIENCE DEPARTMENT
Concluding Remarks Recent advances cope with random I/O better Write IOPS 100x higher than early SSD prototype TPS: 1.3~2x higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumed Write still lags behind IOPSDisk < IOPSSSD-Write << IOPSSSD-Read IOPSSSD-Read / IOPSSSD-Write = 4 ~ 17 A lot more issues to investigate Flash-aware buffer replacement, I/O scheduling, Energy Fluctuation in performance, Tiered storage architecture Virtualization, and much more … COMPUTER SCIENCE DEPARTMENT COMPUTER SCIENCE DEPARTMENT
Question? • For more Flash Memory work of Ours • In-Page Logging [SIGMOD’07] • Logging, Sort/Hash, Rollback [SIGMOD’08] • SSD Architecture and TPC-C [SIGMOD’09] • In-Page Logging for Indexes [CIKM’09] • Even More? • www.cs.arizona.edu/~bkmoon COMPUTER SCIENCE DEPARTMENT