230 likes | 338 Views
UT Research Data Repository. Chris Jordan UT Research Cyberinfrastructure Storage Committee Chair. Outline. UTRC Introduction/Current Status Research Data Requirements Current TACC storage infrastructure (Corral) New UTRC capabilities External services and partnerships
E N D
UT Research Data Repository Chris Jordan UT Research Cyberinfrastructure Storage Committee Chair
Outline • UTRC Introduction/Current Status • Research Data Requirements • Current TACC storage infrastructure (Corral) • New UTRC capabilities • External services and partnerships • Research and UTRC future
UT Research Cyberinfrastructure • Collaborative effort initiated by Dr. Ken Shine, Vice Chancellor for Health • Jay Boisseau (TACC), Brian Herman (UTHSCSA) co-chairs • Assessment of research CI needs across system campuses • Data Storage emerged as highest priority/biggest unmet need
UTRC Proposal • Approved by UT Regents November 2010 • Expanded Lonestar 4 for HPC needs • Establish dedicated 10gb research network to all campuses • Develop replicated, 5PB Research Data Repository
Storage Committee Activities • Proposed iterative approach with pilot deployment in late 2011 • 1st half of 2011 spent on requirements and architecture development • Released RFP in June • Vendor selected in August • Installation in October • Initial users ~December
Sidebar: Why “The Cloud” is not the answer • Cloud storage costs = $1000s/TB/year • Often not as reliable as advertised (Google, Amazon have both had major issues) • Restrictive interfaces, lack of high-performance access • Issues with institutional control, security integration, etc
Pilot UTRDR Deployment • 5PB Raw storage in each of two installations • Main installation at TACC added to existing data infrastructure • Mirror installation at Arlington for replication • High level of redundancy within each installation • Power supplies to storage controllers and servers
Research Data Requirements • Persistent Storage is just the beginning • High reliability/availability is key • Complex, evolving security needs • Importance of Collaboration • Data Applications and Services • Data Management and Analysis • Also, it has to be cheap (or free)
Research Data Security • HIPAA Compliance is a major goal of the UTRDR effort • But HIPAA is just the beginning • Intellectual property and research confidentiality issues are more fine-grained • Long-term issues of availability/usability • Tiers of access, change over time
Example Application Areas • Biology • Biodiversity (natural history collections) • Phylogenetics • Health Sciences • Medical Imaging • High-throughput sequencing • Social Sciences • Economic and social analysis
TACC Corral Architecture • Emphasis on large-scale storage, highly flexible service infrastructure • Fast networks and heterogeneous systems = malleable service and storage platform • Allows integration of UTRC hardware into an existing infrastructure • Near-transparent migration for existing users • Expansion improves reliability and availability
Corral Hardware and Services • 1.2 PetabytesDataDirect SATA Disk • 16 Dell Servers • ~300 TB of heterogeneous disks and servers • High-Performance Parallel File System, multiple databases, iRODS data management, replication to tape archive • Multiple levels of access control • Supports almost any imaginable data need
iRODS at TACC • Distributed/Replicated data management • Corral, Ranch, and offsite storage systems • Extensible metadata support • Policy/Rule-based automation and enforcement • Used for sophisticated data management needs • Provides wide variety of interfaces
Current Corral Usage • >30 Data Allocations & Collections • 350 Users at TACC and UT • >500 External users accessing collections • >500TB Research and Reference Data • Data of all types and disciplines: • Plant specimens and ‘omics, MRI, GIS, Simulations, Fish and Pottery, Economics and Medicine
Added Capabilities w/ UTRDR • Synchronous replication • Very high availability (weather, comet strikes) • Tiers of storage and data management • Huge performance boost (>80GB/sec) • Accessibility from all UT System campuses • HIPAA Compliance
UTRDR Pilot Access • Accelerated access for early adopters • Allows us to shake out bugs, assess readiness for production • Helps to develop requirements present and future • Research network performance assessment • Expect to open to all UT System researchers early 2012
UTRDR Long-term sustainability • After pilot phase, storage will be free to all Pis up to some small limit (5TB?) • Additional storage will be available for cost-recovery fee per TB • Currently only trying to recoup costs on an annual basis • Long-term preservation costs are TBD but are of major interest
Fee-based Research Storage • 2 Major types of service: • Simple storage (iSCSI, SCP/FTP) based on per-TB/year costs • Application services (databases, web applications, data management, etc) • Provides fixed, relatively low costs that can be written into grant proposals • Can include both disk and tape + offsite storage • Long-term model for UTRDR
Existing/Upcoming Partnerships • University of Alaska • UC Berkeley • University of North Texas Libraries • Texas Digital Library • University of Florida • Indiana University • NSF XSEDE – 15 Institutions
UTRC Plan 2012-2013 • Initial production in early 2012 • Design assessment and adjustment based on initial experiences • Expansion proposal mid-2012 • Significant expansion likely late 2012/early 2013 • Ongoing assessment and design adjustments integral to the process
TACC Storage Research • Data upload and ingest processes • Storage reliability and management • Data Integrity/Long-term planning • Automated data management applications • Wide-area storage and replication efforts in the NSF XSEDE project
Acknowledgements • Dr. Ken Shine – UT System • Dr. Patricia Hurn – UT System • Jay Boisseau and Brian Herman • Jerry York – UTHSCSA • UTRC Storage Committee • Brian Grimm, Kevin Granhold, Huapei Chen, Wayne Mueller, Bill Sanns • And many, many others