290 likes | 402 Views
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM. J. Wilkes, R. Golding, C. Staelin T. Sullivan HP Laboratories, Palo Alto, CA. INTRODUCTION. must protect data against disk failures: too frequent and too hard to repair possible solutions: for small numbers of disks: mirroring
E N D
THE HP AUTORAIDHIERARCHICAL STORAGE SYSTEM J. Wilkes, R. Golding, C. StaelinT. Sullivan HP Laboratories, Palo Alto, CA
INTRODUCTION • must protect data against disk failures: too frequent and too hard to repair • possible solutions: • for small numbers of disks: mirroring • for larger number of disks: RAID
RAID • Typical RAID Organizations • Level 3: bit or byte level interleaved with dedicated parity disk • Level 5: block interleaved with parity blocks stored on all disks
LIMITATIONS OF RAID (I) • Each RAID level performs well for a narrow range of workloads • Too many parameters to configure: data- and parity-layout, stripe depth, stripe width, cache sizes, write-back policies, ...
LIMITATIONS OF RAID (II) • Changing from one layout to another or adding capacity requires downloading and reloading the data • Spare disks remain unused until a failure occurs
A BETTER SOLUTION • A managed storage hierarchy: • mirror active data • store in a RAID 5 less active data • This requires locality of reference: • active subset must be rather stable:found to be true in several studies
IMPLEMENTATION LEVEL • Storage hierarchy could be implemented • Manually: can use the most knowledge but cannot adapt quickly • In the file system: offers best balance of knowledge and implementation freedom but specific to a particular file system • Through a smart array controller: easiest to deploy (HP AutoRAID)
MAJOR FEATURES (I) • Mapping of host block addresses to physical disk locations • Mirroring of write-active data • Adaptation to changes in the amount of data stored: • Starts RAID 5 when array becomes full • Adaptation to workload changes: • Hot-pluggable disks, fans, power supplies and controllers
MAJOR FEATURES (II) • On-line storage capacity expansion: system switches then to mirroring • Can mix or match disk capacities • Controlled fail-over: can havedual controllers (primary/standby) • Active hot spares: used for more mirroring • Simple administration and setup: appears to host as one or more logical units • Log-structured RAID 5 writes
RELATED WORK (I) • Storage Technology Corporation Iceberg: • also uses redirection but based on RAID 6 • handles variable size records • emphasis on very high reliability
RELATED WORK (II) • Floating parity scheme from IBM Almaden: • Relocated parity blocks and uses distributed sparing • Work on log-structured file systems at U.C. Berkeley and cleaning policies
RELATED WORK (III) • Whole literature on hierarchical storage systems • Schemes compressing inactive data • Use of non-volatile memory (NVRAM) for optimizing writes • Allows reliable delayed writes
Control Control Control Control OVERVIEW Processor,RAM and Control Logic Parity Logic 2x10MB/s bus DRAM Read Cache Matching RAM NVRAM Write Cache Other RAM SCSIController 20 MB/s Host Computer
PHYSICAL DATA LAYOUT • Data space on disks is broken up into large Physical EXTents (PEXes): • Typical size is 1 MB • PEXes can be combined to form Physical Extent Groups (PEGs) containing at least three PEXes on three different disks • PEGs can be assigned to the mirrored storage class or to the RAID 5 storage class • Segments are the units on contiguous space on a disk (128 KB in prototype)
LOGICAL DATA LAYOUT • Logical allocation and migration unit is the Relocation Block (RB) • Size in prototype was 64 KB: • Smaller RB’s require more mapping information but larger RB’s increase migration costs after small updates • Each PEG holds a fixed number of RB’s
MAPPING STRUCTURES • Map addresses from virtual volumes to PEGs, PEXes and physical disk addresses • Optimized for finding fast the physical address of a RB given its logical address : • Each logical unit has a virtual device table listing all RB’s in the logical unitand pointing to their PEG • Each PEG has a PEG Table listing all RB’s in the PEG and the PEXes used to store them
NORMAL OPERATIONS (I) • Requests are sent to the controller in SCSI Command Descriptor Blocks (CDB): • Up to 32 CB’s can be simultaneously active and 2048 other ones queued • Long requests are broken into 64 KB segments
NORMAL OPERATIONS (II) • Read requests: • Test first to see if data are not already in read cache or in non-volatile write cache • Otherwise allocate space in cache and issue one or more requests to back-end storage classes • Write requests return as soon as data are modified in non-volatile write cache: • Cache has a delayed write policy
NORMALOPERATIONS(III) • Flushing data from cache can involve; • A back-end write to a mirrored storage class • Promotion from RAID 5 to mirrored storage before the write • Mirrored reads and writes are straightforward
NORMAL OPERATIONS (IV) • RAID 5 reads are straightforward • RAID 5 writes can be done: • On a per-RB base: requires two reads and two writes • In batched writes: more complex but cheaper
BACKGROUND OPERATIONS • Triggered when array has been idle for some time • Include • Compaction of empty RB slots, • Migration between storage classes (using an approximate LRU algorithm) and • Load balancing between disks
MONITORING • System also includes: • An I/O logging tool and • A management tool for analyzing the array performance
PERFORMANCE RESULTS (I) • HP AutoRAID configuration with: • 16 MB of controller data cache • Twelve 2.0GB Seagate Barracuda disks (7200rpm) • Compared with: • Data General RAID array with64 MB front-end cache • Eleven individual disk drives implementing disk striping but without any redundancy
PERFORMANCE RESULTS (II) • Results of OLTP database workload: • AutoRAID was better than RAID array and comparable to set of non-redundant drives • But whole database was stored in mirrored storage! • Micro benchmarks: • AutoRAID is always better than RAID array but has smaller I/O rates than set of drives
SIMULATION RESULTS (I) • Increasing the disk speed improves the throughput: • Especially if density remains constant • Transfer rates matter more than rotational latency • 64KB seems to be a good size for the Relocation Blocks: • Around the size of a disk track
SIMULATION RESULTS (II) • Best heuristics for selecting the mirrored copy to be read is shortest queue • Allowing write cache overwrites has a HUGE impact on performance • RB’s demoted to RAID should use existing holes when the system is not too loaded
SUMMARY (I) • System is very easy to set up: • Dynamic adaptation is a big win but it will not work for all workloads • Software is what makes AutoRAID, not the hardware • Being auto adaptive makes AutoRAIDhard to benchmark
SUMMARY (II) • Future work includes: • System tuning especially • Idle period detection • Front-end cache management algorithms • Developing better techniques for synthesizing traces