Optimizing End-User Data Delivery Using Storage Virtualization

Optimizing End-User Data Delivery Using Storage Virtualization Sudharshan Vazhkudai Oak Ridge National Laboratory Ohio State University Systems Group Seminar October 20th, 2006 Columbus, Ohio

Outline • Problem space: Client-side caching • Storage Virtualization: • FreeLoader Desktop Storage Cache • A Virtual cache: Prefix caching • End on a funny note!!

Problem Domain • Data Deluge • Experimental facilities: SNS, LHC (PBs/yr) • Observatories: sky surveys, world-wide telescopes • Simulations from NLCF end-stations • Internet archives: NIH GenBank (serves 100 gigabases of sequence data) • Typical user access traits on large scientific data • Download remote datasets using favorite tools • FTP, GridFTP, hsi, wget • Shared interest among groups of researchers • A Bioinformatics group collectively analyze and visualize a sequence database for a few days: Locality of interest! • Often times, discard original datasets after interest dissipates

So, what’s the problem with this story? • Wide-area data movement is full of pitfalls • Sever bottlenecks, BW/latency fluctuations • GridFTP-like tuned tools not widely available • Popular Internet repositories still served through modest transfer tools! • User applications are often latency intolerant • e.g., real-time viz rendering of a TerraServer map from Microsoft on ORNL’s tiled display! • Why can’t we address this with the current storage landscape? • Shared storage: Limited quotas • Dedicated storage: SAN storage is a non-trivial expense! (4TB disk array ~ $40K) • Local storage: Usually not enough for such large datasets • Archive in mass storage for future accesses: High latency • Upshot • Retrieval rates significantly lower than local I/O or LAN throughput

Is there a silver lining at all? (Desktop Traits) • Desktop Capabilities better than ever before • Space usage to Available storage ratio is significantly low in academic and industry settings • Increasing numbers of workstations online most of the time • At ORNL-CSMD, ~ 600 machines are estimated to be online at any given time • At NCSU, > 90% availability of 500 machines • Well-connected, secure LAN settings • A high-speed LAN connection can stream data faster than local disk I/O

Storage Virtualization? • Can we use novel storage abstractions to provide: • More storage than locally available • Better performance than local or remote I/O • A seamless architecture for accessing and storing transient data

Desktop Storage Scavenging as a means to virtualize I/O access • FreeLoader • Imagine Condor for storage • Harness the collective storage potential of desktop workstations ~ Harnessing idle CPU cycles • Increased throughput due to striping • Split large datasets into pieces, Morsels, and stripe them across desktops • Scientific data trends • Usually write-once-read-many • Remote copy held elsewhere • Primarily sequential accesses • Data trends + LAN-Desktop Traits + user access patterns make collaborative caches using storage scavenging a viable alternative!

Old wine in a new bottle…? • Key strategies derived from “best practices” across a broad range of storage paradigms… • Desktop Storage Scavenging from P2P systems • Striping, parallel I/O from parallel file systems • Caching from cooperative Web caching • And, applied to scientific data management for • Access locality, aggregating I/O, network bandwidth and data sharing • Posing new challenges and opportunities: heterogeneity, striping, volatility, donor impact, cache management and availability

FreeLoader Environment

FreeLoader Architecture • Lightweight UDP • Scavenger device: metadata bitmaps, morsel organization • Morsel service layer • Monitoring and Impact control • Global free space management • Metadata management • Soft-state registrations • Data placement • Cache management • Profiling

FreeLoader installed in a user’s HPC setting GridFTP access to NFS GridFTP access to PVFS hsi access to HPSS Cold data from tapes Hot data from disk caches wget access to Internet archive Testbed and Experiment setup

Comparing FreeLoader with other storage systems

Optimizing access to the cache: Client Access-pattern Aware Striping • Uploading client likely to access more frequently • So, let’s try to optimize data placement for him! • Overlap network I/O with local I/O • What is the optimal local:remote data ratio? • Model

Philosophizing… • What the scavenged storage “is not”: • Not a file system, not a replacement to high-end storage • Not intended for wide-area resource integration • What it “is”: • Low-cost, best-effort storage cache for scientific data sources • Intended to facilitate • Transient access to large, read-only datasets • Data sharing within administrative domain • To be used in conjunction with higher-end storage systems

Towards a “virtual cache” • Scientific data caches typically host complete datasets • Not always feasible in our environment since: • Desktop workstations can fail or space contributions can be withdrawn leaving partial datasets • Not enough space in the cache to host the new dataset in entirety • Cache evictions can leave partial copies of datasets • Can we host partial copies of datasets and yet serve client accesses to the entire dataset? • ~ FileSystem-BufferCache:Disk :: FreeLoader:RemoteDataSource

The Prefix Caching Problem: Impedance Matching on Steroids!! • HTTP Prefix Caching • Multimedia, streaming data delivery • BitTorrent P2P System: leechers can download and yet serve • Benefits • Bootstrapping the download process • Store more datasets • Allows for efficient cache management • Oh…, that scientific data trends again (how convenient…) • Immutable data, Remote source copy, Primarily sequential accesses • Challenges • Clients should be oblivious to dataset being partially available • Performance hit? • How much of the prefix of a dataset to cache? • So, client accesses can progress seamlessly • Online patching issues • Client access to remote patching I/O mismatch • Wide-area download vagaries

Virtual Cache Architecture • Capability-based resource aggregation • Persistent storage & BW-only donors • Client serving: parallel get • Remote patching using URIs • Better cache management • Stripe entirely when space available • When eviction is needed, only stripe a prefix of the dataset • Victims based on LRU: • Evict chunks from the tail until a prefix • Entire datasets evicted only after all such tails are evicted

Prefix Size Prediction • Goal: Eliminate client perceived delay in data access • What is an optimal prefix size to hide the cost of suffix patching? • Prefix size depends on: • Dataset size, S • In-cache data access rate by the client, Rclient • Suffix patching rate, Rpatch • Initial latency in suffix patching, L • Client access rate indicative of time to patch, S/Rclient = L + (S – Sprefix)/Rpatch • Thus, Sprefix = S(1 – Rpatch/Rclient) + LRpatch

Collective Download • Why? • Wide-area transfer reasons: • Storage systems and protocols for HEC are tuned for bulk transfers (GridFTP, HSI) • Wide-area transfer pitfalls: high latency, connection establishment cost • Client’s local-area cache access reasons: • Client accesses to the cache use a smaller stripe size (e.g., 1MB chunks in FreeLoader) • Finer granularity for better client access rates • Can we derive from collective I/O in parallel I/O

Collective Download Implementation • Patching nodes perform bulk, remote I/O; ~ 256MB per request • Reducing multiple authentication costs per dataset • Automated interactive session with “Expect” for single sign on • FreeLoader patching framework instrumented with Expect • Protocol needs to allow sessions (GridFTP, HSI) • Need to reconcile the mismatch in client access stripe size and the bulk, remote I/O request size • Shuffling • Patching nodes, p, redistribute the downloaded chunks among themselves according to the client’s striping policy • Redistribution will enable a round-robin client access • Each patching node redistributes (p – 1)/p of the downloaded data • Shuffling accomplished in memory to motivate BW-only donors • Thus, client serving, collective download and shuffling are all overlapped

Testbed and Experiment setup • UberFTP stateful client to GridFTP servers at TeraGrid-PSC and TeraGrid-ORNL • HSI access to HPSS • Cold data from tapes • FreeLoader patching framework deployed in this setting

Collective Download Performance

Prefix Size Model Verification

Impact of Prefix Caching on Cache Hit rate • Tera-ORNL will see improvements around 0.2 and 0.4 curve (308% and 176% for 20% and 40% prefix ratio) • Tera-PSC sees up to 76% improvement in hit rate with 80% prefix ratio

Intermediate data cache exploits this area Let me philosophize again… • Novel storage abstractions as a means to: • Provide performance impedance matching • Overlap remote I/O, cache I/O and local I/O into a seamless “data pathway” • Provide rich resource aggregation models • Provide a low-cost, best-effort architecture for “transient” data • A combination of best practices from: parallel I/O, P2P scavenging, cooperative caching, HTTP multimedia streaming; brought to bear on “scientific data caching”

Let me advertise… • http://www.csm.ornl.gov/~vazhkuda/Storage.html • Email: vazhkudaiss@ornl.gov • Collaborator: Xiaosong Ma (NCSU) • Funding: DOE ORNL LDRD (Terascale & Petascale initiatives) • Interested in joining our team? • Full time positions and summer internships available

More slides • Some performance numbers • Impact studies

Striping Parameters

Client-side Filters

Computation Impact

Network Activity Test

Disk-intensive Task

Impact Control

Optimizing End-User Data Delivery Using Storage Virtualization