660 likes | 799 Views
Data Storage. Outline. The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access. Outline. The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access.
E N D
Outline • The memory hierarchy • Moore’s law • Disks • Access Times • I/O model of computation • Sorting on disk • Optimize disk access
Outline • The memory hierarchy • Moore’s law • Disks • Access Times • I/O model of computation • Sorting on disk • Optimize disk access
The memory hierarchy • Where data is stored in a computer system? • Capacities • Speeds • Costs
Cache On the same chip as the microprocessor Megabytes Nanoseconds(10-9) for cache I/O=cpu speed 100 nanoseconds for exchanging data between cache and main memory
Main memory 100M-10G or more Fast random access 10-100 nanoseconds for memory access
Virtual memory • Program space: virtual memory address space • On 32bit machine, there are overall 232 =4G address • When larger than actual main memory, data will be stored on disk • Main-memory database systems • manage data through virtual memory, relying on Paging mechanism of OS
Compared to memory Slower (105) One disk i/o takes 10-30 ms more capacious (102) 100g or more more cheaper Magnetic or optical Support Sequential access: fast Random access: slower Related concepts virtual memoryb: pages file systems: files Disk read: moving a block from disk to main memory Disk write: moving a block from main memory to disk Secondary storage: disk
Buffer file for disk I/O Using buffers to read file on disks
Tertiary storage • Capacity • TB=1024G data • Compared to disk • Higher read/write time (103) • More capacious (103) • Cheaper (cost per byte) • Support only sequential access • Typical tertiary storage devices • Ad-hoc tape storage • Optical-disk juke boxes
Memory Hierarchy Access Time Price $/ Mb 1 ns Processor cache 100 x10 RAM 10 6 0.2 x10 Disks 0.2 10 Tapes / Optical Disks x10
Volatile vs nonvolatile • Volatile: Data is lost when power is off • Volatile • cache • Main memory • Non Volatile • Magnetic disks • Tapes • CDROM
Outline The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access
Exponential Growth Moore’s law Double every 18 months: Speed of processors Cost of storage (per bit, in reverse direction) 2x / 18 months ~ 100x / 10 years http://www.intel.com/research/silicon/moorespaper.pdf
Consequences of “Moore’s law” • Storage access becomes ‘slower’ • Latency between data access and computing becomes larger • Data flood: Storage size becomes ‘smaller’ • Latency between requirement of data capacity and actual data capacity becomes larger
Storage access becomes ‘slower’ Storage access time grows slowly relatively “Latency” becomes progressively larger The time moving data between levels of memory hierarchy Vs The time to compute
Data Flood Disk Sales double every nine months Because volume of stored data increases Data Warehouses Internet Logs Web Archives Sky Survey Because media price drops much faster than areal density. Graph courtesy of Joe Hellerstein Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf
Outline The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access
Mechanic of disks Notes 2
Disk surfaces Block: logical unit for data transferring between disk and main memory One block=one or more sectors Notes 2
Disk controller Controlling the moving of head assembly Selecting surfaces, sectors Transferring data A disk controller can control multiple disks Notes 2
Disk storage characteristics • Rotation speed of disk assembly • E.g.,5400 PRM, higher or lower • Number of platters per unit • 5 platters, 10 surfaces • Number of tracks per surfaces • 20,000 tracks • Number of bytes per tracks • A million bytes Notes 2
Disk access characteristics • How to access a block on disk? • Step1: move heads to the proper cylinder-> seek time • Step2: rotate disk to the sectors containing the block->rotation time • Step 3: transfer data-> transfer time Notes 2
Outline The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access Notes 2
Time = Seek Time + Rotational Delay + Transfer Time + Other Notes 2
Disk Access Time block x in memory I want block X ? Notes 2
Seek time 3 or 5x Time x 1 N Cylinders Traveled Proportional to the distance traveled ms Notes 2
Average Random Seek Time N N SEEKTIME (i j) S = N(N-1) j=1 ji i=1 “Typical” S: 10 ms 40 ms Notes 2
Rotational Delay Head Here Block I Want Notes 2
Average Rotational Delay R = 1/2 revolution “typical” R = 8.33 ms (3600 RPM) Notes 2
Transfer Rate: t “typical” t: 1 3 MB/second transfer time: block size t Notes 2
Other Delays CPU time to issue I/O Contention for controller Contention for bus, memory “Typical” Value: 0 Notes 2
So far: Random Block Access What about: Reading “Next” block? Notes 2
If we do things right (e.g., Double Buffer, Stagger Blocks…) Time to get = Block Size + Negligible block t - skip gap - switch track - once in a while, next cylinder Notes 2
Rule of Random I/O: ExpensiveThumb Sequential I/O: Much less Ex: 1 KB Block Random I/O: 20 ms. Sequential I/O: 1 ms. Notes 2
Cost for Writing similar to Reading …. unless we want to verify that the block written was correct! need to add (full) rotation + Block size t Since the head can not go back Notes 2
To Modify a Block? To Modify Block: (a) Read Block (b) Modify in Memory (c) Write Block [(d) Verify?] Notes 2
Outline The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access Notes 2
Computation model for DBMS • Traditional RAM model • Assumes that data in main memory, and access any item of data takes as much time as any other item • Is RAM model suitable for DBMS? • Assumption of DBMS: • Data does not fit into main memory • Support secondary and even tertiary storage Notes 2
I/O model of computation • I/O model of computation • Time taken to perform disk I/O is much larger than the manipulation time of data on main memory • Quantity to minimize: number of block access (I/O) Notes 2
Outline The memory hierarchy Moore’s law Disks Access Times I/O model of computation Sorting on disk Optimize disk access Notes 2
Merge-sort • A main-memory sorting algorithm • Complexity • T(n)=2T(n/2)+an=> T(n)=O(nlogn) • Procedures • Basis: for one element list, do nothing • Induction: • Equally divide the lists into tow sublists • Sort the two sublists • Merge the two sorted sublists Notes 2
Merge two sorted lists Linear to the size of two lists Notes 2
Two-phase, Multiway Merge-sort • Two phases • P1:sort main-memory-sized pieces of data, resulting into a number of sorted sublists • P2: Merge all sorted sublist into a single sorted lists Notes 2
Phase 2 1. find the smallest key among the first remaining elements of all sublists 2. Move the smallest element to the first available position of the output buffer 3.if the out block is full, write it to disk and reinitialize the out buffer 4. if the input buffer is exhausted, read the next block of the sublist into the buffer Notes 2
I/O cost of TPMMS Given a relation with R blocks, The number of I/O of TPMMS is 4R Notes 2
Upper bound of TPMMS • Block size: B bytes, Main memory size: M bytes, Records take R bytes • Total number of records we can sort • = the maximal number of sublists size of each sublist • The number of buffers in main memory: M/B, • one for output, • M/B-1 for input=maximal number of sorted sublists • Each sublists contains at most M/R records • Total number of records we can sort is (M/R)(M/B-1), approximately M2/RB • Let M=108, B=214 R=169, M2/RB=4.2billion=2/3TB Notes 2
Multiway Merging of Larger Relations Use TPMMS to sort groups of M2/RB records, turning them into sorted sublists In a third phase, merge up to (M/B)-1 these lists in a final multiway merge Capability: M3/RB2 In the above example, support 27 trillion records, and about 4.3PB Notes 2