260 likes | 389 Views
Harnessing Petabytes of Online Storage Effectively. 2005/09/27. Jun Nitta (jun.nitta.wg@hitachi.com). Hitachi, Ltd. 1. Introduction: where are we today? 2. Configuring mass online storage 3. Defining distribution of intelligence 4. Miscellaneous topics 5. Summary: beyond 10 petabytes.
E N D
Harnessing Petabytes of Online Storage Effectively 2005/09/27 Jun Nitta (jun.nitta.wg@hitachi.com) Hitachi, Ltd.
1. Introduction: where are we today? 2. Configuring mass online storage 3. Defining distribution of intelligence 4. Miscellaneous topics 5. Summary: beyond 10 petabytes Contents〔目次〕
1 Introduction: where are we today? 中扉
1-1 Looking into the latest specifications of HDDs… model disk size capacity / disks rotational speed (seek / latency) interface (sustained data rate) data buffer 3.5’’ 147GB/5 15,000rpm (3.7ms/2.0ms) 4Gbp/s FC-AL (n/a-93.3MB/s) 16MB for high performance OLTP 3.5’’ 300GB/5 10,025rpm (4.7ms/3.0ms) 2Gbp/s FC-AL (46.8-89.3MB/s) 16MB for most other applications 3.5’’ 500GB/5 7,200rpm (8.5ms/4.2ms) 3Gbp/s SATA-II (31-64.8MB/s) 16MB for large volume archives 2.5’’ 100GB/2 7,200rpm (10ms/4.2ms) 1.5Gbp/s SATA (n/a-n/a) 8MB for small form factor 1.0’’ 8GB/1 3,600rpm (12ms/8.3ms) CE-ATA (5.1-10.0MB/s) 128KB for portable audio player? * based on HGST catalogues as of Sep. 2005
1-2 … and storage subsystems (RAID controllers) roughly: 1 rack = 200 disks (3.5”) = 100TB (500GB drive) enterprise midrange workgroup HDDs 1152 (5 cabinets) 240 225 105 raw capacity 332TB (FC) 72TB (FC) 88.5TB (SATA) 40.5TB (SATA) LUNs 16,384 16,384 2,048 512 FC ports 192 48 4 4 cache 128GB 64GB 8GB 4GB * based on HDS catalogues as of Sep. 2005
1-3 Sheer number of HDDs matters practically HDDs capacity practicality major inhibitor besides $$$ O(100) 500GB - a piece of cake (even possible personally) none storage management O(101) 5TB - today’s enterprise mainstream O(102) 50TB - practical limit for most datacenters O(103) 500TB - challenging but still feasible O(104) 5PB - power & cooling getting impractical O(105) 50PB - almost prohibitive O(106) 500PB - disk failure* * MTBF of a high-end FC HDD is 106h by catalogue spec. (=114yrs, actual number may vary by order of magnitude)
2 Configuring mass online storage: array of nodes or disks? 中扉
server farm (diskless) network CPU storage network (FC or IP) memory HDD storage farm 2-1 Two alternatives to configure online storage array-of-nodes (stack of self-contained boxes) array-of-disks (separate from servers) • Very cost effective for some kind of applications - Secondary data management (especially search) - Can utilize cheapest components • Versatile for various mix of applications • OLTP, ERP, DWH, email, … • Cost is steadily going down
It is reasonable to separate mechanical components HDD is the only mechanical component bedsides a cooling fan It makes much easier to implement hot-swap mechanisms vs. HDD 1 HDD 2 HDD 3 HDD 4 HDD 5 HDD 6 HDD 7 HDD 8 HDD 9 HDD 10 HDD 11 shared hot-spare disk RAID-5 (4D+1P) group 1 RAID-5 (4D+1P) group 2 2-2 Rationale for array-of-disks model • It is reasonable to have external storage subsystems • Disks can be shared among clusters of servers • Spare disks can be shared within a storage subsystem
It makes data management easier* Various data protection techniques can be employed including third-party backup and D2D replication RAID subsystem backup server tape library application server 2-3 Additional discussion for array-of-disks model * Actually backup is one of the most compelling reason to consolidate scattered storages into an external RAID box • For the array-of-nodes configuration, replication between nodes is the almost only viable solution for data protection (conventional backup is difficult to be employed effectively)
O(103-4) of clustered disks even a HDD has CPU and memory (device controller) O(103-4) of clustered nodes general-purpose server with a couple of disks O(100-2) of clustered subsystems special-purpose controller with a lot of disks 2-4 But does this dichotomy has a meaning? • Nonetheless we need storage “controller” for array-of-disks • “Controller” is just another name of a special-purpose server of which restricted operating environment some users prefer • Two configuration differs essentially in CPU-to-HDD ratio determined by intelligence which a storage farm requires basic building block petabytes configuration Which is most promising?
3 Defining distribution of intelligence: protocol and interface 中扉
storage farm server farm storage network (FC or IP) server side intelligence storage side intelligence 3-1 Distribution of intelligence among farms • 3 reasons some functions are better placed at storage side • It is naturally implemented using CPU and memory near HDDs • It requires operations with durable state • It makes multiple servers share data objects • 3 reasons some functions are better placed at server side • It is better implemented using CPU and memory near applications • It requires more powerful and economical CPU / memory • It handles multiple controllers
Is this a part of network or storage farm? storage network (FC or IP) server farm storage farm boundary network core boundary network edge intelligence (server side) network edge intelligence (storage side) 3-2 Alternative way to place intelligence • Some intelligence could be placed on the network • But a closer look reveals that most of those “intelligent network components” are not genuine network core components • Rather they are placed on the boundary between network and server /storage which is not a clear-cut edge but a blurred region
3-3 Placement of functions: an example • Here is an example of intelligence distribution scheme assuming array-of-disks configuration storage side intelligence • basic RAID control / LUN management • remote filesystem • local replication including snapshots (copy-on-write) • volume migration transparent to servers server side intelligence • local filesystem • volume migration among multiple controllers • multi-path management (load balancing & fail over) • content search / indexing intelligence on both side • block aggregation (a.k.a. logical volume management) • remote replication • backup • data encryption
3-4 Which interface & protocol should we adopt? • There are 3 well-established I/O interfaces: block, file, SQL • None of them is optimal for today’s server/storage farm environment • Though file may be most promising for its balanced features • But I/O interface is stubborn to change (very conservative) • Thus multi interface/protocol support is a practical solution interface block file SQL protocol (transport) SCSI-3 (FC or IP) NFS/CIFS-SMB (TCP/IP) proprietary (mostly TCP/IP) strength • low latency • strong standard protocol • broad application • strong standard protocol • high level enough to encapsulate physical properties weakness • layers away from application • not network-friendly • performance and scalability (especially for DBMS) • limited application • no standard protocol
4 Miscellaneous topics for managing petabytes of online storage 中扉
server server level virtualization (HBA/device driver, OS/LVM, DBMS) AP-recognizable volume RAID/block aggregation LU LU switch recognize switch level virtualization LU RAID/block aggregation LU LU controller LU RAID/block aggregation HDD LU HDD HDD export controller controller controller LU LU LU controller level virtualization RAID/block aggr. RAID/block aggr. RAID/block aggr. HDD HDD HDD HDD HDD HDD HDD HDD HDD 4-1 Virtualization: simply too many mappings • “Virtualization” itself is a powerful technology to hide complexity if used properly • But current situation is too confusing • Operating Systems and DBMSs should be aware that a storage volume is a logical network resource • It can even expand and shrink dynamically • There may be more than 100,000 volumes on the network (most OS can recognize up to only about 1,000 volumes)
1) make quiescent 3) resume 2) take snapshot server1 server2 4) mount AP/DBMS data protection manager 5) backup to VTL or replicate to disks agent typical backup scenario for large amount of data controller VTL controller copy on write MT emulation RAID HDD HDD HDD HDD primary volume consistent snapshot HDD HDD HDD HDD HDD HDD HDD HDD 4-2 Data protection: disk plays the protagonist • You have to go to disks at least for the first step to make backup workable for > 10TB of data • Eventually those data may go to tape (D2D2T)
server1 server2 invariant data movement AP/DBMS AP/DBMS sever level mapping path (more flexible) scattered & long yet another mapping yet another mapping switch switch switch level mapping SCSI LUN (less flexible) another mapping another mapping old controller new controller storage level mapping some mapping some mapping localized & short HDD HDD HDD HDD HDD HDD 4-3 Data migration: latent cost of online storage • Since data always outlives its container, you should migrate data from one subsystem to another several times • Non-disruptiveness to upper layer is desirable which requires some form of address mapping • Durable address mapping for storage is not well standardized for both block and file level (cf. URL -[DNS]-> IP address -> MAC address)
[management port security] • user authentication • access control • data-in-flight protection application server • [data port security] • device authentication • access control • data-in-flight protection storage administrator storage subsystem management server • [other subsystem security] • data-at-rest protection • audit logging primary site secondary site 4-4 Security: as always matters • And of course there are a lot of security concerns storage subsystems have to take care of • Data-at-rest protection is much more challenging than data-in-flight because of long-term key management
storage administrator 4-5 Storage resource management: spreadsheet? • Even the basic discovery-and-reporting is still a pain in the neck for most administrators • Most widely used management tool today is a spreadsheet • But can they continue using it for PB environment? • SNIA SMI-S standard seems good because of its set-oriented query capability (SNMP has already gone broken for storage management) • Yet most commercial tools are not proven over PB
today’s typical PB system application server front-end application cache contents server file server contents manager HSM staging disk DBMS HDD HDD HDD metadata DB MT library MT HDD MT HDD MT HDD O(10G-100GB) O(1PB) e.g. 100MB*107files 4-6 Applications: will they use DBMS? • What kind of applications will use petabytes of online storage? • email/IM, voice, video archive, … • stream data from sensor network (including RFID) • geoscience, bioscience, medical, … • How those data will be managed? • Most bulk data may not be stored in RDBMSs but in filesystems (with global name space) • XLM native store may engulf a lot of data (structured and semi-structured) once well established
5 Summary: beyond 10 petabytes 中扉
5-1 Beyond 10 petabytes of data • Continuing capacity growth of HDD enables >10PB online storage within the reach of most IT organizations in 5 years • HDD with perpendicular magnetic recording technology is emerging • Declining $/GB trend shows no sign of discontinuing • Server farm – network – storage farm configuration will continue to dominate enterprise data centers • It is the most cost effective and flexible way to configure online storage for varieties of applications • Protocol and interface between server and storage should evolve to be more network-conscious • But old guards will never die in a foreseeable future • XML data store may come to play a significant role in addition to filesystem and RDBMS • Who knows!
Harnessing Petabytes of Online Storage Effectively 2005/09/27 Jun Nitta Hitachi, Ltd. 奥付