840 likes | 854 Views
Learn how to effectively manage the challenges of petabyte data storage, including data classification, content management, application and database characteristics, and more.
E N D
Managing the Unimaginable:A Practical Approach to Petabyte Data Storage Randy Cochran, Infrastructure Architect, IBM Corporation, hcochran@us.ibm.com TOD -1366 - Information on Demand Infrastructure
Data Storage is Getting Out-of-Hand Are storage demands starting to overpowering you?
Most Research Firms Agree • “It is projected that just four years from now, the world’s information base will be doubling in size every 11 hours.” (“The toxic terabyte; How data-dumping threatens business efficiency”, Paul Coles, Tony Cox, Chris Mackey, and Simon Richardson,IBM Global Technology Services white paper, July 2006) • “Our two-year terabyte CAGR of 52% is 3ppt (percentage points) below rolling four quarter results of 55%.” ("Enterprise Hardware: 2007-08 storage forecast & views from CIOs", Richard Farmer and Neal Austria, Merrill Lynch Industry Overview, 03 January 2007) • “With a 2006–2011 CAGR nearing 60%, there is no lack in demand for storage…” ("Worldwide Disk Storage Systems 2007–2011 Forecast: Mature, But Still Growing and Changing", Research Report # IDC206662, Natalya Yezhkova, Electronics.ca Publications, May 2007) • “According to TheInfoPro…..the average installed capacity in Fortune 1000 organizations has jumped from 198 TB in early 2005 to 680 TB in October 2006. …..TIP found that capacity is doubling every 10 months.” (InfoStor Magazine, Kevin Komiega, October 19, 2006)
What’s Driving Petabyte Level Storage? The “Perfect Storm” Disaster Recovery plans General Increase in demand Declining storage media costs New digital datatechnologies A desire for greater storage efficiency More regulatory requirements Storage technical skills scarcity Better protection from litigation A growing understanding of retained data’s business value Proliferation of Sophisticated applications According to IDC, between 2006 and 2010 information added annually to the digital universe will increase more than six fold from 161 to 988 exabytes.
Just How Big is a Petabyte? Petabyte storage had been around for years – online Petabyte storage has not. “Ninety-two percent of new information is stored on magnetic media, primarily hard disks.” “How Much Information 2003”, UC Berkeley's School of Information Management and Systems
How Big is That in Human Terms? According to Britannica.com the U.S. Library of Congress contains approximately 18 million books, 2.5 million recordings, 12 million photographs, 4.5 million maps, and more than 54 million manuscripts.
Why is Petabyte Storage a Challenge? • Areas Impacted by Petabyte Storage: • Content and File Management • Application & Database Characteristics • Storage Management • Architectural Design Strategy • Performance and Capacity • SAN Fabric Design • Backup and Recovery Methods • Security System Complexity • Compliance with Regulatory Requirements • Operational Policies and Processes • Maintenance Requirements
Management Starts With Data Classification • Data Classification Assumptions • Not all data is created equal • The business value of data changes over time • Performance can be improved by re-allocating data to an optimized storage configuration • The value of most business data is not fixed; it is expected to change over time • Understanding the business value of data is a crucial in designing an effective data management strategy Which data has a greater value to the business - a client’s purchase record, or a memo about last year’s phone system upgrade?
Data Classification Example There are no universally accepted standard definitions for Tier Levels.
Control Your File Content • Implement file aging • Set data retention periods • Eliminate low value data • Clean out old backup files • Eliminate outdated information • Deploy de-duplication technology • Reduce storage of low value data • Locate and purge corrupt files • Crack down on unauthorized storage usage • Periodically review log files and archive or delete obsolete information
Know Your Application and Database Needs • Know your applications needs • User expectations • Workload complexity • Read or write intensity • Sequential files usage • IOPS dependence • Stripe size optimization • Throughput requirements • Service prioritization • Growth expectations • Don’t allow databases to “lock up” vast amounts of storage
Applications Will Drive Storage Requirements • Applications characteristics will drive storage decisions • Value to the business • Number of users • Usage patterns • Steady • Bursty • Cyclical • Variable • 7x24 or 9x5 access • Domestic or global access • Distributed or self-contained • High or low security data • Architectural constraints • Significant performance gains (or losses) can be achieved by matching requirements to storage characteristics
Large Storage Systems Must Be Managed • Information Lifecycle Management (ILM) • Hierarchical Storage Management (HSM) • Storage Resource Management (SRM) • Storage Virtualization "Enterprises can achieve better and more targeted utilization of resources by first establishing the value of their information assets and then using storage management software to execute the policies that define how resources are utilized." Noemi Greyzdorf, research manager, Storage Software, IDC
Information Lifecycle Management “(ILM is) the process of managing business data throughout its lifecycle from conception until disposition across different storage media, within the constraints of the business process.” (courtesy of Veritas Corporation, Nov. 2004) ILM is not a commercial product, but a complete set of products and processes for managing data from its initial inception to its final disposition.
Information Lifecycle Management • Information has business value • It’s value changes over time • It ages at different rates • It has a finite life-cycle • As data ages its performance needs change • Some Information is subject to different security requirements, due to government regulatory or legal enforcements • Outdated information has different disposal criteria • A combination of processes and technologies that determine how information flows through a corporate environment • Encompasses management of information from its creation until it becomes obsolete and is destroyed
“Best Practices” for ILM Implementations • Know exactly where information is stored • Be able to retrieve information quickly and efficiently • Limit access to only those who need to view data • Create policies for managing and maintaining data • Do not destroy important documents • Avoid keeping multiple copies of the same data • Retain information only until it is no longer useful • Destroy outdated files on a regular basis • Document all processes and keep them up-to-date
Hierarchical Storage Management “HSM is a policy-based data storage management system that automatically moves data between high-cost and low-cost storage media, without requiring the knowledge or involvement of the user.” (courtesy of http://searchstorage.tedchtarget.com) IBM has been involved in providing HSM solutions for over 30-years and offer a wide variety of products with automated data movement capabilities.
Hierarchical Storage Management • HSM Concepts • Only 10%-15% of most data is actively accessed • The business value of data changes over time • Between 80% and 90% of all stored data is inactive • High performance storage (FC disks) are expensive • Lower performance media (tape, optical platters, and SATA disk) are comparatively inexpensive 10% 20% 70% Archive
Hierarchical Storage Management • HSM Concepts (cont.) • Enterprise class storage is not required for all data • Policies can be set to establish the proper frequency for transitioning aging data to less expensive media • HSM allows optimal utilization of expensive disk storage • Low cost, high density disks consume fewer resources • Overall storage system performance may improve $$$$ $$$ $$ $
IBM Products with HSM Capabilities • General Parallel File System (GPFS) • IBM Content Manager for Multiplatforms • Tivoli Storage Manager HSM for Windows • Tivoli Storage Manager for Space Management (AIX) • SAN File System (SFS) • DFSMShsm (Mainframe) • High Performance Storage System (HPSS)
Storage Resource Management “Storage Resource Management (SRM) is the process of optimizing the efficiency and speed with which the available drive space is utilized in a storage area network (SAN). Functions of an SRM program include data storage, data collection, data backup, data recovery, SAN performance analysis, storage virtualization, storage provisioning, forecasting of future needs, maintenance of activity logs, user authentication, protection from hackers and worms, and management of network expansion. An SRM solution may be offered as a stand-alone product, or as part of an integrated program suite.” (Definition Courtesy of http://searchstorage.techtarget.com) IBM’s primary tool for Storage Resource Management is their TotalStorage Productivity Center suite of tools for disk, data, fabric, and replication.
Storage Virtualization • Virtualization “The act of integrating one or more (back end) services or functions with additional (front end) functionality for the purpose of providing useful abstractions. Typically virtualization hides some of the back end complexity, or adds or integrates new functionality with existing back end services. Virtualization can be nested or applied to multiple layers of a system.” (Definition Courtesy of http://www.snia.org/education/dictionary) Virtualization allows most of the complexity of a storage infrastructure to be hidden from the user.
Virtualization Makes Storage One Large Pool • Virtualization Characteristics • Makes storage configuration details invisible to the user • Improves overall manageability of the system • Aggregates isolated storage “islands” into a unified view • Facilitates greater flexibility and scalability • Optimizes utilization of storage capacity • Provides the ability to move data on-the-fly • Improves storage subsystems flexibility • Allows rapid re-allocation of storage resources • Improves performance by providing another layer of caching • May provide additional functionality for the SAN
Key Architectural Design Considerations • Resource Consumption • Storage Economics • RAID Allocation • Performance Objectives • Other Design Issues The integrity of the architectural design will determine the overall performance, stability, economic efficiency, manageability and future scalability of the system.
Power Consumption vs. Storage Capacity These disks all have very similar power consumption requirements, even though the largest one features 28 times the capacity of the smaller one. In addition, each disk will require approximately 0.4-0.6 watts of electrical power to cool each BTU of heat produced. ** National retail price of electricity per KwH from “Power, Cooling, Space Efficient Storage”, page 2, ESG white paper, Enterprise Strategy Group, July. 2007.
Comparing Storage Subsystem Power Costs Significant power savings may be realized by redistributing data to the appropriate type and size of disk drive.
Comparing Storage Subsystem Cooling Costs Additional power savings may be realized from the reduced cooling requirements provided by high capacity, lower wattage disk drives.
Comparing Storage Floor-Space Cost The DS4800 and DS4200 storage subsystems include the required number of disk expansion trays mounted in standard equipment racks.
How Do the Costs Add Up? Tiered Storage Approach Traditional Approach Everything on DS8300s DS8300 DS4800 DS4200s with SATA Disk Savings: $614,935 / yr.
A Look at Older Disk Subsystem Efficiency Storing 100 TB of data on more modern storage subsystems results in 50% less power consumption, a 53% reduction in BTUs per hr., and a reduction in required floor space of 38%. In addition, a DS8300 system has over 7x the throughput of the ESS800.
Why is Tiered Storage Important? • Maps data’s business value to disk characteristics • Places data on storage appropriate to its usage • Incorporates lower cost disks • Reduces resource usage (power, cooling, etc.) • Matches user access needs to storage characteristics • Capitalizes on higher capacity disk drive technology • Increases overall performance of the system
A Typical Tiered Storage Architecture DS4800 DS8300 DS4200s with SATA Disk TS3500 Tape Library Business Standard Average Performance / Standard Availability Business Important Good Performance / High Availability Business Critical High Performance / Very High Availability Normally a tiered storage strategy is based on data’s business value. Reference / Historical Near-line or Off-line
The Cost Impact of Adding Disk Trays Note: Calculations based on 146 GB, 10K RPM Drives
TieredStorage Design Pros and Cons • Advantages • Lower initial purchase price • Higher capacity per square foot • Reduced power consumption • Decreased requirement for cooling • Increased equipment flexibility • Potentially a higher performance solution • Disadvantages • Inherently a more complex architecture • Greater up-front effort to design and implement • Requires advanced storage design skills and knowledge
RAID Selection Decision Drivers • Application or Database characteristics • Read/write mix • Dependency on IOPS • RAID Performance characteristics • Appropriate RAID level • Number of disks per array • Stripe size • Available bandwidth • Configuration rules and recommendations • Loss from data parity and hot sparing • Disk failure probability • RAID parity rebuild times
Loss from Mirroring, Striping, and Sparing RAID10 = Mirror plus Stripe RAID1 = Mirror Only
Loss from RAID5 Parity and Sparing Note: The second tray has one 2+P array to allow for one spare drive per two trays. Note: The second tray has one 6+P array to allow for one spare drive per two trays. Note: Each tray has one spare drive per tray.
Other Architectural Considerations • Compatibility • High availability • Architectural robustness • Flexibility and scalability • Stability of the technology • Vendor’s financial standing • Well defined product line roadmap • Support for industry standards
Storage Subsystem Performance Drivers • Business objectives and user expectations • Applications and database characteristics • Server characteristics • SAN fabric characteristics • Storage controller characteristics • Caching characteristics • Configuration characteristics • Disk latency characteristics "We can't solve problems by using the same kind of thinking we used when we created them." Albert Einstein
Storage Performance Enhancers • Data Mover – Reassigning data transfer tasks to a specialized “engine” reduces the workload on the host processing system. • Search Engines – Systems dedicated to executing searches in vast amounts of stored data to satisfy specific requests. • Directory Services – Stores and organizes information about resources and data objects. • High Speed Interconnections – Dedicated “behind the scenes” networks dedicated to the transfer of large amounts of data. • Autonomic Computing – must have an ability to reconfigure itself under varying and possibly unpredictable conditions.