740 likes | 756 Views
Lecture 11: Storage Organization Reading: Chapter 10. Lecture 11: Storage Organization. Why a Memory Hierarchy? Storage Media Options and Their Characteristics RAID Buffer Management. Classification of Physical Storage Media. Speed with which data can be accessed
E N D
Lecture 11: Storage Organization • Why a Memory Hierarchy? • Storage Media Options and Their Characteristics • RAID • Buffer Management
Classification of Physical Storage Media • Speed with which data can be accessed • Locating Data + Transferring Data • Cost per unit of data • Reliability • data loss on power failure or system crash • physical failure of the storage device • Can differentiate storage into: • volatile storage: loses contents when power is switched off • non-volatile storage: • Contents persist even when power is switched off. • Includes secondary and tertiary storage, as well as battery-backed up main-memory.
Why a Storage Hierarchy? FASTER CHEAPER “There is No Such Thing as a Big Fast Memory”
Key Barriers to Big Fast Memory • Transmission time increases with distance • Even light travels in time proportional to distance • Locating time increases with size • Needle in a haystack problem • Impossible to have everything close by • Keep frequently needed stuff closer
Dr. Manuel Rodriguez Martinez Big IdeaKeep Frequently Accessed Data Closer CPU CACHE Response Request for Data MAIN MEMORY MAGNETIC DISK TAPE OPTICAL DISK
Dr. Manuel Rodriguez Martinez DBMS and Storage • DBMS will try to keep in memory frequently used data • Go to disk only when you have to • Primary vs Secondary Storage Performance • If you need to go to tape or optical disk, bring lots of data (e.g. a few dozen or hundred MBs) • Secondary vs Tertiary Storage Performance • Most DBMS use tertiary storage to bring dataset that are big and used on one or few seldom run queries • Ex: Satellite Images, old data from backups • Magnetic Disk (or simply Disk) is the predominant method for storage in DBMS Probability of finding random data item at level l (Hit Ratio) Average time to access data item at level l
Memory Hierarchy Access Time Example • A cache achieves a 90% hit ratio and a transfer rate of 100GB/s. Upon a cache miss, the data must be fetched into the cache from lower levels of the memory hierarchy achieving an average access time of 10 ms/GB. Assume locating data in the cache takes negligible time. • Calculate the average access time to read 1TB of data Dr. Manuel Rodriguez Martinez
Storage Hierarchy (Cont.) • primary storage: Fastest media but volatile (cache, main memory). • secondary storage: next level in hierarchy, non-volatile, moderately fast access time • also called on-line storage • E.g. flash memory, magnetic disks • tertiary storage: lowest level in hierarchy, non-volatile, slow access time • also called off-line storage • E.g. magnetic tape, optical storage
Cache Memory • Cache: • fastest and most costly form of storage • volatile; managed by the computer system hardware • usually on same chip with CPU • common to use multiple cache levels https://www.pugetsystems.com/labs/articles/Specs-Explained-CPU-137/
Main Memory (RAM) • fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds) • generally too small (or too expensive) to store the entire database • capacities of up to a few Gigabytes widely used currently • Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years) • Volatile — contents of main memory are usually lost if a power failure or system crash occurs.
Flash Memory (SSD’s) • Data survives power failure • Data can be written at a location only once, but location can be erased and written to again • Can support only a limited number (10K – 1M) of write/erase cycles. • Erasing of memory has to be done to an entire bank of memory • Reads are roughly as fast as main memory • But writes are slow (few microseconds), erase is slower • Widely used in embedded devices such as digital cameras, phones, and USB keys Image From: http://www.guru-store.com/blog/tutoriales/todo-lo-que-debes-saber-sobre-las-unidades-de-estado-solido-ssd/
Flash Storage • NOR flash vs NAND flash • NAND flash • used widely for storage, since it is much cheaper than NOR flash • requires page-at-a-time read (page: 512 bytes to 4 KB) • transfer rate around 20 MB/sec • solid state disks: use multiple flash storage devices to provide higher transfer rate of 100 to 200 MB/sec • erase is very slow (1 to 2 millisecs) • erase block contains multiple pages • remapping of logical page addresses to physical page addresses avoids waiting for erase • translation table tracks mapping • also stored in a label field of flash page • remapping carried out by flash translation layer • after 100,000 to 1,000,000 erases, erase block becomes unreliable and cannot be used • wear leveling
Garbage Collection in SSD’s Image from Wikipedia.org
SSD vs HDD Price Comparison http://www.enterprisestorageforum.com/storage-hardware/ssd-vs.-hdd-pricing-seven-myths-that-need-correcting-5.html
Magnetic Hard Disk Mechanism NOTE: Diagram is schematic, and simplifies the structure of actual disk drives
Magnetic Disks • Data is stored on spinning disk, and read/written magnetically • Primary medium for the long-term storage of data; typically stores entire database. • Data must be moved from disk to main memory for access, and written back for storage • Much slower access than main memory (more on this later) • direct-access – possible to read data on disk in any order, unlike magnetic tape • Capacities range up to roughly few TBs as of 2015 • Much larger capacity and cost/byte than main memory/flash memory • Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2 years) • Survives power failures and system crashes • disk failure can destroy data, but is rare
Magnetic Disks • Read-write head • Positioned very close to the platter surface (almost touching it) • Reads or writes magnetically encoded information. • Surface of platter divided into circular tracks • Over 50K-100K tracks per platter on typical hard disks • Each track is divided into sectors. • A sector is the smallest unit of data that can be read or written. • Sector size typically 512 bytes • Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks) • To read/write a sector • disk arm swings to position head on right track • platter spins continually; data is read/written as sector passes under head • Head-disk assemblies • multiple disk platters on a single spindle (1 to 5 usually) • one head per platter, mounted on a common arm. • Cylinder iconsists of ith track of all the platters
Magnetic Disks (Cont.) • Earlier generation disks were susceptible to head-crashes • Surface of earlier generation disks had metal-oxide coatings which would disintegrate on head crash and damage all data on disk • Current generation disks are less susceptible to such disastrous failures, although individual sectors may get corrupted • Disk controller – interfaces between the computer system and the disk drive hardware. • accepts high-level commands to read or write a sector • initiates actions such as moving the disk arm to the right track and actually reading or writing the data • Computes and attaches checksums to each sector to verify that data is read back correctly • If data is corrupted, with very high probability stored checksum won’t match recomputed checksum • Ensures successful writing by reading back sector after writing it • Performs remapping of bad sectors
Disk Subsystem • Multiple disks connected to a computer system through a controller • Controllers functionality (checksum, bad sector remapping) often carried out by individual disks; reduces load on controller • Disk interface standards families • ATA (AT adaptor) range of standards • SATA (Serial ATA) • SCSI (Small Computer System Interconnect) range of standards • SAS (Serial Attached SCSI) • Several variants of each standard (different speeds and capabilities)
Disk Subsystem • Disks usually connected directly to computer system • In Storage Area Networks (SAN), a large number of disks are connected by a high-speed network to a number of servers • In Network Attached Storage (NAS) networked storage provides a file system interface using networked file system protocol, instead of providing a disk system interface
Dr. Manuel Rodriguez Martinez Few Notes on Disks • Too expensive and inefficient to bring a few bytes from disk • You often bring one ore more disk blocks • 1 block is the minimal amount of I/O done on disk • For a read or write operation (rw operation) • 1 I/O = 1 block , for read and write • 1 block consist of one or more disk sectors • Implementation of file system layer in DBMS defines this • Few magic numbers for block sizes • 512 bytes • 1KB • 4KB
Cost of performing I/O from Disk • To access a single block the cost if defined as • Cost = seek time + rotational delay + transfer time • Seek Time • Time for the arm to get into the right track (or cylinder) • Mechanical movement cost • Rotational Delay • Time before the target block gets underneath the reading head • ½ of time for 1 disk rotation • Mechanical movement cost • Transfer time • Time to move the bytes from the disk into the RAM • Electronics cost Dr. Manuel Rodriguez Martinez
Example • Let a disk have: • Average seek time of 11 ms • Rotational delay of 6 ms • Transfer rate of 10MB/sec • Block size of 1 KB • How much is the cost for one I/O? Recall that 1KB = 1024 bytes Dr. Manuel Rodriguez Martinez
Random I/O vs Sequential I/O • Let B1, B2, …,Bn be a sequence of n blocks to be read/written from disk • Random I/O • Blocks are read/written from difference regions on the disk • In worst case, one seek and delay per block. • Cost = n * (seek + delay + transfer time) • Sequential I/O • Blocks are read/written one after the other from the same cylinder or adjacent cylinder (just a few seeks and delays) • In best case, after we reach B1, all others blocks show up next without incurring in seek or delay cost • Cost = seek + delay + n*(transfer time) • DBMS want to do Sequential I/O!!! Dr. Manuel Rodriguez Martinez
Example • A disk has seek time of 8 ms, rotational delay of 4ms, 12 MB/sec transfer rate, 4 double-sized platter, 512 byte sector, block of 2 sectors, 50 sectors per track, and 2000 tracks per surface: • How much capacity does this disk have? • How much space does a cylinder have? Dr. Manuel Rodriguez Martinez
Example (Continued…) • A disk has seek time of 8 ms, rotational delay of 4ms, 12 MB/sec transfer rate, 4 double-sized platter, 512 byte sector, block of 2 sectors, 50 sectors per track, and 2000 tracks per surface: • How long would it take to read 20,000 blocks with random I/O? • How long to read 20,000 blocks with ideal sequential I/O? Dr. Manuel Rodriguez Martinez
Optical Storage • Optical storage • non-volatile, data is read optically from a spinning disk using a laser • CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms • Blu-ray disks: 27 GB to 54 GB • Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R) • Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM) • Reads and writes are slower than with magnetic disk • Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data
Optical Disks • Compact disk-read only memory (CD-ROM) • Removable disks, 640 MB per disk • Seek time about 100 msec (optical read head is heavier and slower) • Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to magnetic disks • Digital Video Disk (DVD) • DVD-5 holds 4.7 GB , and DVD-9 holds 8.5 GB • DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB and 17 GB • Blu-ray DVD: 27 GB (54 GB for double sided disk) • Slow seek time, for same reasons as CD-ROM • Record once versions (CD-R and DVD-R) are popular • data can only be written once, and cannot be erased. • high capacity and long lifetime; used for archival storage • Multi-write versions (CD-RW, DVD-RW, DVD+RW and DVD-RAM) also available
Tape Storage • Tape storage • non-volatile, used primarily for backup (to recover from disk failure), and for archival data • sequential-access– much slower than disk • very high capacity (40 to 300 GB tapes available) • tape can be removed from drive storage costs much cheaper than disk, but drives are expensive • Tape jukeboxes available for storing massive amounts of data • hundreds of terabytes (1 terabyte = 109 bytes) to even multiple petabytes (1 petabyte = 1012 bytes)
Magnetic Tapes • Hold large volumes of data and provide high transfer rates • Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format • Transfer rates from few to 10s of MB/s • Tapes are cheap, but cost of drives is very high • Very slow access time in comparison to magnetic and optical disks • limited to sequential access. • Some formats (Accelis) provide faster seek (10s of seconds) at cost of lower capacity • Used mainly for backup, for storage of infrequently used information, and as an off-line medium for transferring information from one system to another. • Tape jukeboxes used for very large capacity storage • Multiple petabyes (1015 bytes)
Dr. Manuel Rodriguez Martinez Performance in DBMS • The performance of a DBMS depends on: • CPU usage • I/O usage • Network usage • We shall concentrate on I/O • Disk I/O performance can be defined in terms of: • Resource usage time: time using the disk • Response time: wall-clock time to complete the query • Number of I/Os: number of times an I/O operation is performed • Parallel I/O – bringing data from various disk simultaneously • Response time <> Resource usage time • In this case usually Response time << Resource usage time
Disks as performance bottlenecks … • Microprocessor speed increase 50% per year. • Disk performance improvements • Access time decreases 10% per year • Transfer rate decreases 20% per year • Disk crash results in data loss. • Solution: Disk array • Have several disk behave as a single, large and very fast disk. • Parallel I/O • Put some redundancy to recover from a failure somewhere in the array Dr. Manuel Rodriguez Martinez
Dr. Manuel Rodriguez Martinez Disk Array Organization • Several disks are grouped into a single logical unit. System Bus CPU Memory Controller Disk Array Controller Bus
Dr. Manuel Rodriguez Martinez Disk Striping • Disk Striping is a mechanism to divide the data in a file into segments that are scattered over the disks of the disk array. • The minimum size of a segment is 1 bit, in which case each data blocks must be read from several disks to extract the appropriate bits. • The drawback of this approach is the overhead of managing data at the level of bits. • Better approach is to have a striping unit of 1 disk block. • Sequential I/O can be run parallel since block can be fetched in parallel from the disks.
Dr. Manuel Rodriguez Martinez Disk Striping – Block sized • Disk Striping can be used to partition the data in a file into equal-sized segments of a block size that are distributed over the disk array. File Disk Blocks Disk Array Controller Bus
Dr. Manuel Rodriguez Martinez Data Allocation • Data is partitioned into equal sized segments • Striping unit • Each segment is stored in a different disk of the arrays • Typically, round-robin algorithm is used • If we have n disks, then block i is stored at disk • i mod n • Example: Array of 5 disks, and file of 1MB with a 4KB Striping unit • Disk 0: gets blocks: 0, 5, 10, 15, 20, … • Disk 1: gets blocks: 1, 6, 11, 16, 21, … • Disk 2: gets blocks: 2, 7, 12, 17, 22, … • Etc.
Dr. Manuel Rodriguez Martinez Benefits of Striping • With Striping we can access data blocks in parallel! • issue a request to the proper disks to get the blocks • For example, suppose we have a 5-disk array with 4KB striping and disk blocks. Let F be a 1MB file. If we need to access blocks 0, 11, 22, 23, then we need to ask: • Disk 0 for bock 0 at time t0 • Disk 1 for bock 11 at time t0 • Disk 2 for bock 22 at time t0 • Disk 3 for bock 23 at time t0 • All these requests are issued by the DBMS and are serviced concurrently by the disk array!
Dr. Manuel Rodriguez Martinez Single Disk Time Line Elapsed Clock Time 0 11 22 23 t0 t1 Disk Service Time Read Request Completed Read Request
Dr. Manuel Rodriguez Martinez Striping Time Line Elapsed Clock Time Parallel I/O t0 t1 Disk Service Time Read Request Completed Read Request
Dr. Manuel Rodriguez Martinez Time access estimates • Access time: seek time + rotational delay + transfer time • Disk used independently or in array: IBM Deskstar 14GPX 14.4 GB disk • Seek time: 9.1 milliseconds (msecs) • Rotational delay: 4.15 msecs • Tranfer rate: 13MB/sec • How does striping compares with a single disk? • Scenario: 1disk block(4KB) striping-unit, access to blocks 0, 11, 22, and 23. Disk array has 5 disks • Editorial Note: Looks like an exam problem!
Total time = sum of time to read each partition Time for block 0: 9.1 msec + 4.15msec + 4KB/(13MB/1sec)*(1MB/1024KB)*(1000msec/1sec) = 9.1 msec + 4.15msec + 0.3 msecs = 13.55 msecs Time for block 11: 9.1 msec + 4.15msec + 4KB/(13MB/1sec)*(1MB/1024KB)*(1000msec/1sec) = 9.1 msec + 4.15msec + 0.3 msecs = 13.55 msecs Time for block 22: 9.1 msec + 4.15msec + 4KB/(13MB/1sec)*(1MB/1024KB)*(1000msec/1sec) = 9.1 msec + 4.15msec + 0.3 msecs = 13.55 msecs Time for block 23: 9.1 msec + 4.15msec + 4KB/(13MB/1sec)*(1MB/1024KB)*(1000msec/1sec) = 9.1 msec + 4.15msec + 0.3 msecs = 13.55 msecs Total time: 4 * 13.55 msec = 54.20 msecs Single Disk Access time
Dr. Manuel Rodriguez Martinez Striping Access Time • Total time: maximum time to complete any read quest. • Following same calculation as in previous slide: • Time for block 0: 13.55 msec • Time for block 11: 13.55 msec • Time for block 22: 13.55 msec • Time for block 23: 13.55 msec • Total time: • max{13.55msec, 13.55msec 13.55msec 13.55msec} = 13.55 msec • In this case, Striping gives us a 4-1 better (4 times) performance because of parallel I/O.
Dr. Manuel Rodriguez Martinez The problem with Striping • Striping has the advantage of speeding up disk access time. • But the use of a disk array decrease the reliability of the storage system because more disks mean more possible points of failure. • Mean-time-to-failure (MTTF) • Mean time to have the disk fail and lose its data • MTTF is inversely proportional to the number of components in used by the system. • The more we have the more likely they will fall apart!
Dr. Manuel Rodriguez Martinez MTTF in disk array • Suppose we have a single disk with a MTTF of 50,000 hrs (5.7 years). • Then, if we build an array with 50 disks, then the have a MTTF for the array of 50,000/50 = 1000 hrs, or 42 days!, because any disk can fail at any given time with equal probability. • Disk failures are more common when disks are new (bad disk from factory) or old (wear due to usage). • Morale of the story: More does not necessarily means better!
Dr. Manuel Rodriguez Martinez Increasing MTTF with redundancy • We can increase the MTTF in a disk array by storing some redundant information in the disk array. • This information can be used to recover from a disk failure. • This information should be carefully selected so it can be used to reconstruct original data after a failure. • What to store as redundant information? • full data block? • Parity bit for a set of bit locations across the disks
RAID Levels • Schemes to provide redundancy at lower cost by using disk striping combined with parity bits • Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics • RAID Level 0: Block striping; non-redundant. • Used in high-performance applications where data loss is not critical. • RAID Level 1: Mirrored disks with block striping • Offers best write performance. • Popular for applications such as storing log files in a database system.