1 / 30

Storage Solutions for Bioinformatics

Storage Solutions for Bioinformatics. Li Yan Director of FlexLab,  Bioinformatics core technology laboratory liyan3@genomics.cn http:// www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen. OUTLINE. Background Hardware Infrastructure of Data Storage

suchi
Download Presentation

Storage Solutions for Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen

  2. OUTLINE • Background • Hardware Infrastructure of Data Storage • Data Management • Data Storage Architecture In BGI • Distributed Computing on Storage Server

  3. Background: Fast Growing Big Data

  4. Sequencing, sequencing and sequencing

  5. Background

  6. Fast growing big data • From small genomes to large complex genomes • E. coli Genome: 4.9M • Caenorhaditiselegans Genome: 100M • Human Genome: 3G • Wheat Genome: 16G • Salamander: 45G • From one sample to populations • Human Genome: 3 billion DNA subunits (A,T,C,G) • 80~100X Sequencing: 600GB Raw data for individual study • 1000 Genome Project: 600TB Raw data for population study • From the first generation sequencing to the second generation sequencing

  7. Long-Term Data Storage Needs • Properly secure the data • Plan for data redundancy, which generally means we mirror data with two or more copies • Available(24x7x365) for all kinds of uses • Readily accessible and in the right format • Fast Data Transfer for collaborations • Fast Network server(Aspera) instead of mailing a hard drive • Scalable, easy to scale up • Choosing reliable file systems

  8. Hardware infrastructure of data storage

  9. Type of Storageinfrastructure • Disk library • A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing. • Magnetic tape • A high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges. • Redundant array of independent disks (RAID) • RAID is a storage technology that combines multiple disk drive components into a logical unit • Direct-attached storage (DAS) • a digital storage system directly attached to a server or workstation, without a storage network in between • Network-attached storage (NAS) • Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. • Storage area network (SAN) • A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage.

  10. Software Level of Data storage

  11. Data flow of NGS Alignment Assembly Raw Data Sequencer Association Complex workflow • Annotation of features • Variations/Mutations • Protein Structural • Gene Expressions • Function Networks Meaningful Biology Data Data Store

  12. Data Management • Classify the data into different levels • First Level of Storage: Dynamic, fast, Temporary • Secondary Level of storage: Slower than first level, but enduring and safety • Third Level of storage: High capacity medium for backups and archives • Choosing file systems • Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.

  13. Classify the data into different levels • First Level of Storage: Dynamic, fast, Temporary • intermediate results of data analysis • Reference data • … • Secondary Level of storage: Slower than first level, but enduring and safety • Sequencing raw data • Meaningful data • Third Level of storage: High capacity medium for backups and archives • Backups and archives of raw data and meaningful data

  14. Distributed File systems • Lustre • lustreis a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system. • Hadoop(HDFS) • Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework. • OneFS • OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput. Distributed file systems Storage Server

  15. Distributed File systems • MogileFS(www.danga.com) • FreeNAS( www.openqrm.org ) • FastDFS(code.google.com / p / fastdfs) • OpenAFS( www.openafs.org ) • MooseFS(derf.homelinux.org) • pNFS( www.pnfs.com ) • GoogleFS

  16. Data compression&& Data security • Data compression • Common used: • Lemple-Ziv, BWT • Exclusive used for DNA sequences: • Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp • Data security • Raid system failure/ Redundancy • File system • Network

  17. Data Storage Architecture In BGI

  18. Data Storage Architecture In BGI Two Copies Archiving Write Write Tape Library Write Read Sequencers Compute Nodes Read Write

  19. Data Storage Architecture In BGI Two Copies Archiving Write Write Tape Library Write Read Sequencers Compute Nodes Read First Level Storage Write

  20. Data Storage Architecture In BGI Two Copies Archiving Write Write Second Level Storage Tape Library Write Read Sequencers Compute Nodes Read Write

  21. Data Storage Architecture In BGI Two Copies Archiving Write Write Tape Library Write Read Third Level Storage Sequencers Compute Nodes Read Write

  22. Data Storage Architecture In BGI Two Copies Archiving Write Write Tape Library Write Read Sequencers Compute Nodes Read Write

  23. Distributed Computing on Storage Server

  24. Traditional Genome Assembly Costly, Unscalable NGS read file Sequence Assembly Large memory server >500GB Users Storage

  25. Distributed Genome Assembly Several storage server(IBM3630*16 for human genome) …… Assembly Cost effectively, Scalable

  26. Hecate Constructing de bruijn Graph Solving Tiny Repeats Merging Bubbles Merging Contigs Scaffolding

  27. Reads Gaea 2.1 Reference genome Distributed Indexing for load balancing Preprocessing Flexible splitting tolerates more mistmatches Locating Aligning Dynamic Programming for robust gap alignment SNP calling Standard mapping quality for SNP calling

  28. Q&A

More Related