320 likes | 441 Views
CS 162 Section. Lecture 8. What happens when you issue a read() or write() request?. Life Cycle of An I/O Request. User Program. Kernel I/O Subsystem. Device Driver Top Half. Device Driver Bottom Half. Device Hardware. When should you return from the read()/write() call?.
E N D
CS 162 Section Lecture 8
Life Cycle of An I/O Request User Program Kernel I/O Subsystem Device Driver Top Half Device Driver Bottom Half Device Hardware
Interface Timing • Blocking Interface: “Wait” • When request data (e.g.,read()system call), put process to sleep until data is ready • When write data (e.g.,write()system call), put process to sleep until device is ready for data • Non-blocking Interface: “Don’t Wait” • Returns quickly from read or write request with count of bytes successfully transferred to kernel • Read may return nothing, write may write nothing • Asynchronous Interface: “Tell Me Later” • When requesting data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies user • When sending data, take pointer to user’s buffer, return immediately; later kernel takes data and notifies user
Track Head Cylinder Software Queue (Device Driver) Hardware Controller Media Time (Seek+Rot+Xfer) Request Result Magnetic Disk Characteristic Sector • Cylinder: all the tracks under the head at a given point on all surfaces • Read/write data is a three-stage process: • Seek time: position the head/arm over the proper track (into proper cylinder) • Rotational latency: wait for the desired sectorto rotate under the read/write head • Transfer time: transfer a block of bits (sector)under the read-write head • Disk Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Xfer Time • Highest Bandwidth: • Transfer large group of blocks sequentially from one track Platter
We have a disk with the following parameters: • 1TB in size • 7200 RPM, Data transfer rate of 40 Mbytes/s (40 × 106bytes/sec) • Average seek time of 6ms • ATA Controller with 2ms controller initiation time • A block size of 4Kbytes (4096 bytes) What is the average time to read a random block from the disk?
SSD • No penalty for random access • Rule of thumb: writes 10x more expensive than reads, and erases 10x more expensive than writes (read 25μs) • Limited drive lifespan • Controller maintains pool of empty pages by coalescing used sectors (read, erase, write), also reserve some % of capacity • Controller uses ECC, performs wear leveling • OS may provide TRIM information about “deleted” sectors (normally only file system knows about unallocated blocks, not the disk drive)
File System • Transforms blocks into Files and Directories • Optimize for access and usage patterns • Maximize sequential access, allow efficient random access
File System Caching • Optimizations for sequential access: • Try to store consecutive blocks of a file near each other • Store inode near data blocks • Try to locate directory near the inodes it points to • Buffer cache used to increase file system performance • Read Ahead Prefetching and Delayed Writes • Key Idea: Exploit locality by caching data in memory • Name translations: Mapping from pathsinodes • Disk blocks: Mapping from block addressdisk content • Buffer Cache: Memory used to cache kernel resources, including disk blocks and name translations • Can contain “dirty” blocks (blocks yet on disk) • Size: adjust boundary dynamically so that the disk access rates for paging and file access are balanced
File System Caching (cont’d) • Delayed Writes: Writes to files not immediately sent out to disk • Instead, write() copies data from user space buffer to kernel buffer (in cache) • Enabled by presence of buffer cache: can leave written file blocks in cache for a while • If some other application tries to read data before written to disk, file system will read from cache • Flushed to disk periodically (e.g. in UNIX, every 30 sec) • Advantages: • Disk scheduler can efficiently order lots of requests • Disk allocation algorithm can be run with correct size value for a file • Some files need never get written to disk! (e..g temporary scratch files written /tmp often don’t exist for 30 sec) • Disadvantages • What if system crashes before file has been written out? • Worse yet, what if system crashes before a directory file has been written out? (lose pointer to inode!)
Log Structured and Journaled File Systems • Better reliability through use of log • All changes are treated as transactions • A transaction is committed once it is written to the log • Data forced to disk for reliability • Process can be accelerated with NVRAM • Although File system may not be updated immediately, data preserved in the log • Difference between “Log Structured” and “Journaled” • In a Log Structured file system, data stays in log form • In a Journaled file system, Log used for recovery • For Journaled system: • Log used to asynchronously update filesystem • Log entries removed after used • After crash: • Remaining transactions in the log performed (“Redo”) • Modifications done in way that can survive crashes • Examples of Journaled File Systems: • Ext3 (Linux), XFS (Unix), HDFS (Mac), NTFS (Windows), etc.
Key Value Store • Very large scale storage systems • Two operations • put(key, value) • value = get(key) • Challenges • Fault Tolerance replication • Scalability serve get()’s in parallel; replicate/cache hot tuples • Consistency quorum consensus to improve put() performance
Key Value Store • Also called a Distributed Hash Table (DHT) • Main idea: partition set of key-values across many machines key, value …
Chord Lookup • Each node maintains pointer to its successor • Route packet (Key, Value) to the node responsible for ID using successor pointers • E.g., node=4 lookups for node responsible for Key=37 lookup(37) 4 58 8 node=44 is responsible for Key=37 15 44 20 35 32
Chord • Highly scalable distributed lookup protocol • Each node needs to know about O(log(M)), where m is the total number of nodes • Guarantees that a tuple is found in O(log(M)) steps • Highly resilient: works with high probability even if half of nodes fail