200 likes | 216 Views
Delve into the infrastructure and pipeline updates discussed at the recent Nelson Lab meeting in UCLA's Human Genetics Department. Explore the issues related to data processing speed, networking, and storage scalability, and learn about the current Solexa and Solid work flows, IPAR status, and automation requirements. Gain insights into the state of networking, storage solutions, and the evolving strategies for managing genetics data effectively. Discuss future plans for the SeqWarehouse project and explore available tools for data mining.
E N D
Infrastructure & Pipeline Update March 3, 2009: Nelson Lab Meeting Jordan Mendler Nelson Lab, UCLA Human Genetics Department jmendler@ucla.edu
Talking Points • Solexa and ABI work flow • Where is my data? • Why does it take so long to get it back? • State of Networking issues • Why is the sequencer moving images too slowly? • Why is Nils' connection slow? • Future of backup, primary and fast storage • How come /home is not... • Infinitely fast from one connection? • Linearly scaling to multiple compute nodes? • Always online? • Thoughts and Plans for SeqWarehouse Project • How can I get all hits in a given region across every experiment? • What tools are available to better mine my data?
Solexa Work Flow (current) • Sequencer runs and simultaneously copies images to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are done • Sets up Illumina Pipeline for processing (tens of minutes) • Generates intensities (days) • Calls bases from intensities (hours) • Copy data to primary solexa_datasets (hours) • Compress, sample and copy images to primary (days, but forked off) • Every hour Rocks looks for runs that have finished #4 • Run bfast (hours) • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files
Solexa Work Flow (IPAR) • Sequencer runs and simultaneously copies images AND generates and copies intensities to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are ready • Sets up Illumina Pipeline for base calling (minutes) • Generates intensities (days) • Call bases from intensities (hours) • Copy data to primary solexa_datasets (hours) • Every hour Rocks looks for runs ready for alignment • Run bfast (hours) • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files
IPAR Status • Brian has implemented support into the pipeline • LIMS is being updated so operator can indicate IPAR or no IPAR • Production testing is pending a successful run • IPAR is supposed to be working since upgrade, but has failed for the last 2-3 runs since the upgrade • Switch on order should resolve any hypothetical networking issues • More on this later
Solid Work Flow (current) • In preparation, Technician makes sure Images and Results are empty • Tech starts runs Tag1 (1 week) • Tech deletes images from Tag1 to make space for Tag2 • Tech runs Tag2 (1 week) • If QC passes, Technician deletes Tag2 images and gives me go-ahead • I email Brian from Solid to generate intensities for Tag1 (1-2 days) • I copy fasta, quality and secondary data to our primary storage (hours) • After generation, I copy and delete intensities for Tag1 (hours) • I email Brian from Solid to generate intensities for Tag2 (1-2 days) • I copy the primary data for Tag2, and delete everything else • I inform the Technician the sequencer is ready for the next run
Solid Work Flow (current) • Why so tedious?
Solid Work Flow (current) • Why so tedious? • Lack of space on the sequencer until next version
Solid Work Flow (current) • Why so tedious? • Lack of space on the sequencer until next version • Unfortunate, since a $8k array would be enough for 2-3 full runs
Solid Automation Requirements • More on-sequencer disk space • Out of our control until the next version of Solid • With more disk space, intensities can be generated on-the-fly • After run finishes, SPCH and other temp files can be removed to free most of the storage used (currently needed for intensity generation) • Integration with SeqTools • Brian has started the integration • Data migration is easy, but still some uncertainties as to what to keep • Alignment to be based on Nils' BFAST color space settings • Corona Lite can be run once John and Bret get it working • BFAST output converted to SLO/Bed/Wig can use common tools • Does not help that much, until things can be automated on Solid's side
Nelson Lab Storage • Current design: • Keep things as simple and stable as possible • Data grows unevenly • Speed is limited to single node-to-server link • A given file only lives on primary server, so non linear scaling
Lab Storage • As with prior lab meetings, storage is still something we are watching • Commercial: Reliable, Vendor Lock In, Expensive Support Contracts • Isilon ($1500-$2500/TB), BlueArc • NAS design bottlenecks each transfer to a single head node • DDN Hardware w/ Lustre ($1250-$1750/TB) • Do-It-Yourself: Unknown Stability, Potentially very faster, cheaper & more flexible • Lustre ($350-400/TB) • Very fast, and allows for parallel I/O • Not HA, but could be good for scratch space • GlusterFS ($350-400/TB, $700-800/TB with HA) • Unknown performance and stability • Feature set allows for both parallel I/O and HA • ZFS w/ Many JBODs ($250-$300/TB) • Stable filesystem, with all disks locally attached • Questions about how a single server will react to 100's of drives • Bottleneck will be single network connection, but fine for backup • Built-in snapshots, compression, etc
SeqWarehouse Backend Storage • Relational Database • Full range of queries • Potential for very slow query time due to volume of data far exceeding RAM, and unintelligent ordering of data • Exactly how slow is still being determined • Berkeley DB • Key/Value Database stored in flat file • Very stable, large community and supports all common languages • Support for joins and range queries requires custom API on top • Flat Files • Requires the implementation and support of non-standard components • HDF5/NetCDF • I/O libraries for fast access, and extensive grouping and organization of data • No built-in indexing support
SeqWarehouse Backend Storage • HBase/others • Distributed, Replicated Store of Key/Value pairs/groups • Relatively new and developmental, but deployed in Petabyte scale • Keys are stored in sorted order for fast retrieval of similar keys • Designed for search: edu.ucla, edu.ucla.genome, edu.ucla.genome.solexa • Easily modeled as: hg18_chr1_pos12345, hg18_chr1_pos12401, hg18_chr1_pos1248, ... • Optional compression, timestamping • Access is not SQL-like. No joins, query engine, types. Syntax: • Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1"); • Scanner scanner = table.getScanner(new String[] "myColumnFamily:columnQualifier1"}) • RowResult rowResult = scanner.next(); • rowResult.getRow();
Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project
Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project • Would anyone be interested in these types of tools, or are they too low-level? • What types of things would be most helpful to everyone?
Generic Thank You Slide • Thanks to: • Brian for help with pipeline and data warehousing project • Bret for help with storage and networking • Mike Yourshaw for input on schemas and data modeling • Patrick @ Ctrl for help with rewiring the patch panel • Stan • Everyone else