200 likes | 214 Views
Infrastructure & Pipeline Update March 3, 2009: Nelson Lab Meeting Jordan Mendler Nelson Lab, UCLA Human Genetics Department jmendler@ucla.edu. Talking Points. Solexa and ABI work flow Where is my data? Why does it take so long to get it back? State of Networking issues
E N D
Infrastructure & Pipeline Update March 3, 2009: Nelson Lab Meeting Jordan Mendler Nelson Lab, UCLA Human Genetics Department jmendler@ucla.edu
Talking Points • Solexa and ABI work flow • Where is my data? • Why does it take so long to get it back? • State of Networking issues • Why is the sequencer moving images too slowly? • Why is Nils' connection slow? • Future of backup, primary and fast storage • How come /home is not... • Infinitely fast from one connection? • Linearly scaling to multiple compute nodes? • Always online? • Thoughts and Plans for SeqWarehouse Project • How can I get all hits in a given region across every experiment? • What tools are available to better mine my data?
Solexa Work Flow (current) • Sequencer runs and simultaneously copies images to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are done • Sets up Illumina Pipeline for processing (tens of minutes) • Generates intensities (days) • Calls bases from intensities (hours) • Copy data to primary solexa_datasets (hours) • Compress, sample and copy images to primary (days, but forked off) • Every hour Rocks looks for runs that have finished #4 • Run bfast (hours) • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files
Solexa Work Flow (IPAR) • Sequencer runs and simultaneously copies images AND generates and copies intensities to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are ready • Sets up Illumina Pipeline for base calling (minutes) • Generates intensities (days) • Call bases from intensities (hours) • Copy data to primary solexa_datasets (hours) • Every hour Rocks looks for runs ready for alignment • Run bfast (hours) • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files
IPAR Status • Brian has implemented support into the pipeline • LIMS is being updated so operator can indicate IPAR or no IPAR • Production testing is pending a successful run • IPAR is supposed to be working since upgrade, but has failed for the last 2-3 runs since the upgrade • Switch on order should resolve any hypothetical networking issues • More on this later
Solid Work Flow (current) • In preparation, Technician makes sure Images and Results are empty • Tech starts runs Tag1 (1 week) • Tech deletes images from Tag1 to make space for Tag2 • Tech runs Tag2 (1 week) • If QC passes, Technician deletes Tag2 images and gives me go-ahead • I email Brian from Solid to generate intensities for Tag1 (1-2 days) • I copy fasta, quality and secondary data to our primary storage (hours) • After generation, I copy and delete intensities for Tag1 (hours) • I email Brian from Solid to generate intensities for Tag2 (1-2 days) • I copy the primary data for Tag2, and delete everything else • I inform the Technician the sequencer is ready for the next run
Solid Work Flow (current) • Why so tedious?
Solid Work Flow (current) • Why so tedious? • Lack of space on the sequencer until next version
Solid Work Flow (current) • Why so tedious? • Lack of space on the sequencer until next version • Unfortunate, since a $8k array would be enough for 2-3 full runs
Solid Automation Requirements • More on-sequencer disk space • Out of our control until the next version of Solid • With more disk space, intensities can be generated on-the-fly • After run finishes, SPCH and other temp files can be removed to free most of the storage used (currently needed for intensity generation) • Integration with SeqTools • Brian has started the integration • Data migration is easy, but still some uncertainties as to what to keep • Alignment to be based on Nils' BFAST color space settings • Corona Lite can be run once John and Bret get it working • BFAST output converted to SLO/Bed/Wig can use common tools • Does not help that much, until things can be automated on Solid's side
Nelson Lab Storage • Current design: • Keep things as simple and stable as possible • Data grows unevenly • Speed is limited to single node-to-server link • A given file only lives on primary server, so non linear scaling
Lab Storage • As with prior lab meetings, storage is still something we are watching • Commercial: Reliable, Vendor Lock In, Expensive Support Contracts • Isilon ($1500-$2500/TB), BlueArc • NAS design bottlenecks each transfer to a single head node • DDN Hardware w/ Lustre ($1250-$1750/TB) • Do-It-Yourself: Unknown Stability, Potentially very faster, cheaper & more flexible • Lustre ($350-400/TB) • Very fast, and allows for parallel I/O • Not HA, but could be good for scratch space • GlusterFS ($350-400/TB, $700-800/TB with HA) • Unknown performance and stability • Feature set allows for both parallel I/O and HA • ZFS w/ Many JBODs ($250-$300/TB) • Stable filesystem, with all disks locally attached • Questions about how a single server will react to 100's of drives • Bottleneck will be single network connection, but fine for backup • Built-in snapshots, compression, etc
SeqWarehouse Backend Storage • Relational Database • Full range of queries • Potential for very slow query time due to volume of data far exceeding RAM, and unintelligent ordering of data • Exactly how slow is still being determined • Berkeley DB • Key/Value Database stored in flat file • Very stable, large community and supports all common languages • Support for joins and range queries requires custom API on top • Flat Files • Requires the implementation and support of non-standard components • HDF5/NetCDF • I/O libraries for fast access, and extensive grouping and organization of data • No built-in indexing support
SeqWarehouse Backend Storage • HBase/others • Distributed, Replicated Store of Key/Value pairs/groups • Relatively new and developmental, but deployed in Petabyte scale • Keys are stored in sorted order for fast retrieval of similar keys • Designed for search: edu.ucla, edu.ucla.genome, edu.ucla.genome.solexa • Easily modeled as: hg18_chr1_pos12345, hg18_chr1_pos12401, hg18_chr1_pos1248, ... • Optional compression, timestamping • Access is not SQL-like. No joins, query engine, types. Syntax: • Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1"); • Scanner scanner = table.getScanner(new String[] "myColumnFamily:columnQualifier1"}) • RowResult rowResult = scanner.next(); • rowResult.getRow();
Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project
Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project • Would anyone be interested in these types of tools, or are they too low-level? • What types of things would be most helpful to everyone?
Generic Thank You Slide • Thanks to: • Brian for help with pipeline and data warehousing project • Bret for help with storage and networking • Mike Yourshaw for input on schemas and data modeling • Patrick @ Ctrl for help with rewiring the patch panel • Stan • Everyone else