310 likes | 435 Views
FACIT Tools For Distributed Collections. A breakout session at the NDIPP Partners meeting July 09, 2008 Terry Moore, University of Tennessee Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB) Santiago de Ledesma, ACCRE, Vanderbilt. Overview . FACIT project Basic idea
E N D
FACIT Tools For Distributed Collections A breakout session at the NDIPP Partners meeting July 09, 2008 Terry Moore, University of Tennessee Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB) Santiago de Ledesma, ACCRE, Vanderbilt
Overview • FACIT project • Basic idea • Application context: NGDA • FACIT Technology • Logistical Networking “inside” • LoDN • L-Store • FACIT technology and the problem of long-term preservation of bits
What is FACIT • FACIT – Federated Archive Cyberinfrastructure Testbed • Goal of FACIT: Create a testbed to experiment with a different approach to federated resource sharing for access and preservation • FACIT partners: • National Geospatial Data Archive (NGDA: UCSB and Stanford) The NGDA is an NDIIPP partner • Logistical Networking (UTK)– network storage tech • REDDnet (Vanderbilt) – NSF funded infrastructure using LN for data intensive collaboration
NGDA Overview • National Geospatial Digital Archive • Focus: long-term archiving (100 year problem) • Emphasis on geospatial data • Policy level archive - not architecture specific • Based on 20+ years of experience @ UCSB
Pertinent Details • Preservation through Simplicity • Key component of architecture is the Data Model: all other parts considered disposable • Objects maintain self-descriptive 'manifests' • Archive organization and object structure both based on file systems • Data Model allows easy tie-in to L.N.
NGDA and Logistical Networking • Logistical Networking as an abstracted storage layer • Increase download speeds • Logistical Networking used as a Tool for custodianship • Logistical Networking as content transfer solution • Logistical Networking as a backup for at-risk data (temporary stewardship)
The Future of NGDA and LN • Initial trials have met with mixed success • Lots of mixed size objects in test sets (30,000) • Upload set data size of ~1TB • “Moderate” Download Speeds to LC over WAN • Roughly 1 day per TB download • The near future • Middleware to bridge search client & LN cloud • Adjustments to handle mixed data sets
Basic elements of the LN stack • Highly generic, “best effort” protocol for using storage • Generic -> doesn’t restrict applications • “best effort”-> low burden on providers • Easy to port and deploy • Metadata container for bit-level structure • Modeled on Unix inode • bit-level structure, control keys, … • XML encoded
Sample exNodes Partial exNode encoding Tennessee Vanderbilt UCSB Stanford REDDnet Depots Network 0 100 200 300 A B C Crossing administrative domains, sharing resources
New federation members? Tennessee Vanderbilt UCSB Stanford LoC REDDnetDepots Network • Add new depots • Copy the data • Rewrite the exNodes
Basic elements of the LN stack LoDN L-Store • Highly generic, “best effort” protocol for using storage • Generic -> doesn’t restrict applications • “best effort”-> low burden on providers • Easy to port and deploy • Metadata container for bit-level structure • Modeled on Unix inode • bit-level structure, control keys, … • XML encoded
LoDN - Network File Manager • Store files into the Logistical Network using Java upload/download tools. • Manages exNodemaintenance and replication • Provides account (single user or group) as well as “world” permissions.
What is L-Store? • Goal of L-Store: Use LN to provide a generic, high performance, wide area capable, storage virtualization service • Provides a file system interface to (globally) distributed IBP depots (e.g. currently uses WebDAV and CIFS) • Flexible role based AuthZ (work in progress)
L-Store and Logistical Networking 3 GB/s 30 Mins • L-Store adds a name space on top of the exnode layer • Allows for LN operations on the name space. • LN’s parallelism for high performance and reliability, e.g. parallel transfers to improve performance (3GB/s during SC06 demo)
L-Store scalability • L-Store uses a Distributed Hash Table to store all its “structural” metadata (i.e. metadata about how the bits are stored) • DHTs provide a highly scalable way of storing metadata. • Metadata and data can scaled independently.
Storage Management • Nevoa Networks (Brazilian company based on LN) provides management of remote/distributed storage via StorCore • Provides resource discovery for L-Store. • Allows to group depots to form Logical units. • It can create dynamic logical units based on queries.
L-Store and FACIT FACIT drives L-Store development: • L-Sync: An rsync-like tool that uses L-Store as intermediate storage. • Extended metadata attributes. • A flexible policy framework.
Questions about • NGDA? • Logistical Networking? • LoDN? • L-Store?
Discussion: Preservation’s storage problem • Long-term preservation is a relay: Repeated migrations across storage media/systems, archive systems, institutions • Begin with the bits: Storage technology changes every 3-5 yrs • During some periods of time data will be in “steady state” • But during a century, there will be 20-30 handoffs! • How can we create a “handoff process” that can be sustained for century or more? Can we create a “technical” process or will a social process have to do? • Complicating factor: We’re drowning in data 95 0 5 15 90 100 20 10
Framing The Issue Globally • World data expected to total 2 zettabytes by 2011 (IDC Whitepaper) “As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.”
What does experience show? SDSC’s archive shows exponential growth w/ a consistent doubling period of ~15 months
If preservation is a relay, then … • The key preservation problem at the bit layer is … • Choice 1: steady state data storage • Choice 2: copying data to different systems Impression: De facto choice is #1 • When you have to “hand-off” data do, is sufficient to have • Choice 1: A social solution • Choice 2: A technical solution Impression: De facto choice is #1 Contention: Neither of these de facto choices is adequate