310 likes | 382 Views
SQL Server, Storage and You - Part III: Solid State Storage. Contact Information. Wesley Brown wes@planetarydb.com Twitter @ WesBrownSQL Blog http://www.sqlserverio.com. Today’s Topic Covers…. NAND Flash Structure MLC and SLC Compared NAND Flash Read Properties
E N D
Contact Information Wesley Brown wes@planetarydb.com Twitter @WesBrownSQL Blog http://www.sqlserverio.com
Today’s Topic Covers… • NAND Flash Structure • MLC and SLC Compared • NAND Flash Read Properties • NAND Flash Write Properties • Wear-Leveling • Garbage Collection • Write Amplification • TRIM • Error Detection and Correction • Reliability • Form Factor • Performance Characteristics • Determining What’s Right for You • Not All SSD’s Are Created Equal
Types Of Flash • Two Main Flavors NAND And NOR • NOR • Operates like RAM. • NOR is parallel at the cell level. • NOR reads slightly faster than NAND. • Can execute directly from NOR without copy to RAM. • NAND • NAND operates like a block device a.k.a. hard disk. • NAND is serial at the cell level. • NAND writes significantly faster than NOR. • NAND erases much faster than NOR--4 ms vs. 5 s.
Structure of NAND • Serial array of transistors. • Each transistor holds 1 bit(or more). • Arrays grouped into pages. • 4096 bytes in size. • Contains “spare” area for ECC and other ops. • Pages grouped into Blocks • 64 to 128 pages. • Smallest erasable unit. • Pages grouped into chip • As big as 16 Gigabytes. • Chips grouped on to devices. • Usually in a parallel arrangement.
MLC vs. SLC, FIGHT! • MLC (Multi-Level Cell) • Higher capacity (two bits per cell). • Low P\E cycle count 3k~ 10K~. • Cheaper per Gigabyte. • High ECC needs. • SLC (Single-Level Cell) • Fast read speed • 25ns vs. 50ns • Fast Write Speed • 220ns vs. 900ns • High P\E cycle count 100k~ to 300k~ • Tend to be conservative numbers. • Minimal ECC requirements • 1 bit per 512 bytes vs. 12~ bits per. • Expensive • Up to 5x the cost of MLC.
Reading NAND Flash • It isn’t RAM. • Slower access times. • 1~ ns vs. 50~ ns. • No write in place. • It isn’t a hard disk. • Much faster access times. • Nanoseconds vs. Milliseconds • No moving parts.
Writing to NAND • Program Erase Cycle • Erased state all bits are 1. • Programmed bits are 0. • Programmed pages at a time. • One pass programming. • Erased block at a time(128 pages). • Must erase entire block to program a single page again. • Finite life cycle, 10k~ MLC 100k~ SLC. • Once failed to erase may still be readable.
Data written in pages and erased in blocks. Blocks are becoming larger as NAND Flash die sizes shrink.
Feeding And Care of NAND • Wear-Leveling • Spreads writes across blocks. • Ideally, write to every block before erasing any. • Data grouped into two patterns. • Static, written once and read many times. • Dynamic, written often read infrequently. • If you only Wear-Level data in motion you burn out the page quickly. • If you Wear-Level static data you are incurring extra I/O
Keeping Things Fast • Background Garbage Collection • Defers P/E cycle. • Pages marked as dirty, erased later. • Requires spare area. • Incurs additional I/O. • Can be put under pressure by frequent small writes.
No Free Lunches • Write Amplification • Ripples in a pond. • Device moves blocks around. • Incoming I/O greater than Device has. • Every write causes additional writes. • Small writes can be a real problem. • OLTP workloads are a good example. • TRIM can help.
Four new pages and four replacement pages written. Original pages are now marked invalid.
Garbage collection comes along and moves all valid pages to a new block and erases the other block.
Keeping Things Fast • TRIM • Supported out of the box on Windows 7, Windows 2008 R2. • Some manufacturers are shipping a TRIM service that works with their driver • Acts like spare area for garbage collection. • OS and file system tell drive block is empty. • Filling file system defeats TRIM. • File fragmentation can hurt TRIM. • Grow your files manually! • Don’t run disk defrag!
Detecting Errors and Correcting Them Many things cause errors on Flash! • Write Disturb • Data Cells NOT being written to are corrupted. • Fixed with normal erase. • Read Disturb • Repeated reads on same page effects other pages on block. • Fixed with normal erase. • Charge Loss/Gain • Transistors may gain or lose charge over time. • Flash devices at rest or rarely accessed data. • Fixed with normal erase. All of these issues are generally dealt with very well using standard ECC techniques.
As cells are programmed other cells may experience voltage change.
As cells are read other cells in same block can suffer voltage change.
If flash is at rest or rarely read cells can suffer charge loss.
Pure Speed • Not all drives are benchmarked the same. • Short-stroking • Only using a small portion of the drive. • Allows for lots of spare capacity via TRIM. • Huge queue depths. • Increases latency. • Can be unrealistic. • Odd block transfer sizes. • Random IO testing. • Some use 512 byte while others use 4k. • Sequential IO testing. • Most use 128k. • Some use 64k to better fit into large buffers. • Some use 1mb and high queue depths.
How Fast Is It Again? • Read the numbers carefully. • Random IO bench usually 4k. • SQL Server works on 8k. • Sequential IO bench usually 128k. • SQL Server works on 64k to 128mb • Queue depths set high. • SQL Server usually configured for low Queue depth.
Is It Reliable Enough? • SLC is ready “Out of the box.” • Requires much less infrastructure on disk to support robust write environments. • MLC needs some help. • Requires lots of spare area and smarter controllers to handle extra ECC. • eMLC has all management functions built onto the chip. • Both configured similarly. • RAID of chips. • TRIM, GC and Wear-Leveling
He’s Dead Jim. • Longevity between devices can be huge. • Consumer grade drives are consumable. • Aren’t rated for full drive writes. • Desktop drives usually tested on a fraction of drive capacity! • Aren’t rated for continuous writes. • It may say three year life span. • Could be much shorter look at total writes.
You Say SATA I Say SAS… • SAS is the king of your heavy workloads. • Command Queuing • SAS supports up to 216 usually capped at 64. • SATA supports up to 32. • Error recovery and detection. • SMART isn’t. • SCSI command set is better. • Duplex • SAS is full duplex and dual ported per drive. • SATA is single duplex and single ported. • Multi-path IO • Native to SAS at the drive level. • Available to SATA via expanders.
The Shape Of Things. • Flash comes in lots of form factors. • Standard 2.5” and 3.5” drives, • Fibre Attached • Texas Memory System RAM-SAN 620 • Violin Memory • PCIe add-in cards. • Few “native” cards. • Fusion-io • Texas Memory System RAM-SAN 20 • Bundled solutions. • LSI SSS6200 • OCZ Z-Drive • OCZ Revodrive • PCIe To Disk • 2.5” form factor and plugs • Skips SAS/SATA for direct PCIe lanes.
Understand Your Workloads! • You MUST understand your workloads. • Monitor virtual file stats • http://sqlserverio.com/2011/02/08/gather-virtual-file-statistics-using-t-sql-tsql2sday-15/ • Track random vs. sequential • Track size of transfers • Capture IO Patterns • http://sqlserverio.com/2010/06/15/fundamentals-of-storage-systems-capturing-io-patterns/ • Benchmark! • http://sqlserverio.com/2010/06/15/fundamentals-of-storage-testing-io-systems/
I’m Not As Fast As I Use To Be • From new • Best possible performance. • Drive will never be this fast again. • Previous writes effect future reads. • Large sequential writes nice for GC. • Small random writes slow GC down. • Wait for GC to catch up when benching drive. • Give the GC time to settle in going from small random to large sequential or vice versa. • Steady state is what we are after. • Performance over time slows. • Cells wear out. • Causes multiple attempts to read or write • ECC saves you but the IO is still spent.
It’s a Sony on the inside, trust me. • Not all drives are equal. • Understand drives are tuned for workloads. • Desktop drives don’t favor 100% random writes… • Enterprise drives are expected to get punished. • Fix it with firmware. • Most drives will have edge cases. • OCZ and Intel suffered poor performance after drive use over time. • Be wary of updates that erase your drive. • Gives you a temporary performance boost.
Takeaways • Flash read performance is great, sequential or random. • Flash write performance is complicated, and can be a problem if you don’t manage it. • Flash wears out over time. • Not nearly the issue it use to be, but you must understand your write patterns. • Plan for over provisioning and TRIM support. • It can have a huge impact on how much storage you actually buy. • Flash can be error prone. • Be aware that writes and reads can cause data corruption.