180 likes | 391 Views
Tranche. A distributed repository for large, immutable files. James A. Hill, Philip Andrews University of Michigan, Ann Arbor August 28, 2009. 1. Outline. Introduction Technical Overview Description Challenges Future Plans ProteomeCommons.org Tranche Repository Tranche and caGrid. 2.
E N D
Tranche A distributed repository for large, immutable files. James A. Hill, Philip Andrews University of Michigan, Ann Arbor August 28, 2009 1
Outline Introduction Technical Overview Description Challenges Future Plans ProteomeCommons.org Tranche Repository Tranche and caGrid 2
Introduction: The Proteomics Data Challenge • Proteomics projects generate large amounts of data at high cost. • Reuse of data sets was nearly nonexistent. • File formats proprietary. • Virtually no data sets were publicly available. • Proteomics technologies evolve rapidly. • Data types and relationships change quickly • No infrastructure existed for sharing or publicly disseminating raw proteomics data sets. • Documentation for peer review was limited or non-existent. • The Paris Guidelines were developed by major proteomics journals and domain experts to address data documentation concerns. 3
Introduction: Initial Goals Lower the barriers to putting proteomics data in the public domain Consistent with dissemination requirements of funding agencies Address data release requirements of journals Provide a tool for sharing large data sets Explore data repository model whose costs are shared by the community 4
Technical Overview:Summary Distributed Files are immutable No structure, file locations Integrity Provenance Security Upload Encryption SSL communication option Versioning In place of mutability Licensing All data is uploaded with a license file that denotes what can and cannot be done with the data being uploaded. Self-repairing (potentially self-moderating) Distributed Server Model 5
Technical Overview:Tranche Hash “Location” of data on the repository Recalculate for validation Four Sections MD5 Hash SHA-1 Hash SHA-256 Hash Data Length Hash Span Range of hashes Full Hash Span Range of all possible hashes 0 - 2^608 6
Technical Overview:Network Architecture Servers assigned hash spans Chunk location determined by server hash spans Max. 3 checks (3 servers with hash in hash spans) Network must have x full hash spans between servers, where x is the number of desired chunk replications In a world of 10 hashes: 7
Technical Overview:Uploading File encoding and chunking occurs on the client. Encryption Upload with a passphrase Privately share passphrase for dissemination Compression (GZIP) High-throughput data can get very high rates of compression Users must have X.509 certificates signed by a trusted certificate 8
Technical Overview:Implementation Open Source (Apache 2 License) Java GUI Java Swing library Launched using Java Web Start ProteomeCommons.org Apache Tomcat MySQL Database 9
Technical Overview:Challenges Knowing and searching what is on the repository No “master” (which also means no bottleneck or single point of failure) Tracking by an external system Data access controls Passphrase access controls are clumsy Does not meet HIPAA regulations 10
Technical Overview:Future Plans Currently implementing a major redesign of the Tranche architecture based on two years of observation and experimentation. Scalability Speed Availability Additional functionality: Server self-modification of hash spans Replace failed servers Reaching maximum storage capacity locally Customizable, machine-readable licensing 11
ProteomeCommons.org TrancheRepository The only public repository Began operating October, 2006 Network 15 servers at 8 institutions ~80 TB raw disk space, ~20 TB used disk space 7,856 data sets (10.1 TB unique data) Stats as of August 24, 2009. Graph Created July 16, 2009. 12
ProteomeCommons.org TrancheRepository: Participants 375 users registered to upload data (August 24, 2009) Participating Data Providers CPTC Centers (NCI) (Clinical Proteomic Technologies for Cancer) MMRF Centers ABRF (Assoc. of Biomedical Core Facilitites) HUPO (Human Proteome Organization) CSAR (Community Structure-Activity Resource) Mouse Models for Cancer PGP (Personal Genome Project) Participating Data Aggregators/Analyzers TheGPMdb PeptideAtlas PRIDE Graph Created July 16, 2009. 13
Tranche and caGrid: Why? Why use Tranche? High-throughput data has a high cost Procurement Storage Movement Uniqueness Comparison with other file sharing technologies like BitTorrent Design features: provenance, integrity, etc. Availability: file sharing guarantees data is available only as long as clients are on the network and have the desired files. Long-term availability: file sharing is not the same as file storage. Lower community storage cost: “Silo” file sharing and storage systems can waste storage space by over-replication (if they were to keep data long-term). 14
Tranche and caGrid: Proposed Uses ProteomeCommons.org Tranche Annotations (PCTA) Initial Use Case Can be done today Search the caBIG data service for data sets – Tranche hash stored for reference. Download the data sets from ProteomeCommons.org Graphical User Interface Use command-line clients for automation Use API to create custom scripts Drop a generic Tranche client into caGrid Already compatible (both written in Java) A generic client would be able to converse with any Tranche repository. This ability does not yet exist. Would need to reference Tranche data by a specially-designed URL schema to differentiate between repositories. 15
Tranche and caGrid: More Proposed Uses Create a generic data service software for tracking and managing data, and providing queryable metadata describing the contents of the repository Use ProteomeCommons.org as a prototype Without a security overhaul, the metadata could contain any patient identifiers. Make the repository and/or web portal HIPAA compliant May be necessary for some users Could likely be worked around Would take the most time 16
Acknowledgements Primary Investigator Phil Andrews ProteomeCommons/Tranche Development Team James (Augie) Hill Bryan Smith Mark Gjukich Panagiotis Papoulias Jayson Falkner (emeritus) caBIG mentors Dong Fu Baris Suzek caBIG Staff Brian Davis Michael Keller Natasha Sefcovic Funding NCI CPTAC NCRR (P41 RR018627) State of Michigan (CTA) 17
Links www.proteomecommons.org www.trancheproject.org 18