Genetics Data Pipeline Update: Challenges and Future Strategies at UCLA

Infrastructure & Pipeline Update March 3, 2009: Nelson Lab Meeting Jordan Mendler Nelson Lab, UCLA Human Genetics Department jmendler@ucla.edu

Talking Points • Solexa and ABI work flow • Where is my data? • Why does it take so long to get it back? • State of Networking issues • Why is the sequencer moving images too slowly? • Why is Nils' connection slow? • Future of backup, primary and fast storage • How come /home is not... • Infinitely fast from one connection? • Linearly scaling to multiple compute nodes? • Always online? • Thoughts and Plans for SeqWarehouse Project • How can I get all hits in a given region across every experiment? • What tools are available to better mine my data?

Solexa Work Flow (current)‏ • Sequencer runs and simultaneously copies images to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are done • Sets up Illumina Pipeline for processing (tens of minutes)‏ • Generates intensities (days)‏ • Calls bases from intensities (hours)‏ • Copy data to primary solexa_datasets (hours)‏ • Compress, sample and copy images to primary (days, but forked off)‏ • Every hour Rocks looks for runs that have finished #4 • Run bfast (hours)‏ • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files

Solexa Work Flow (IPAR)‏ • Sequencer runs and simultaneously copies images AND generates and copies intensities to Solexa Assistant • Zugen enters run into LIMS system and flags it for processing • Every hour Solexa Assistant looks for runs that are ready • Sets up Illumina Pipeline for base calling (minutes)‏ • Generates intensities (days)‏ • Call bases from intensities (hours)‏ • Copy data to primary solexa_datasets (hours)‏ • Every hour Rocks looks for runs ready for alignment • Run bfast (hours)‏ • Generate SLO files from output • Every hour Solexa looks for runs ready for reporting • Generate BED/Wig/etc from SLO files

IPAR Status • Brian has implemented support into the pipeline • LIMS is being updated so operator can indicate IPAR or no IPAR • Production testing is pending a successful run • IPAR is supposed to be working since upgrade, but has failed for the last 2-3 runs since the upgrade • Switch on order should resolve any hypothetical networking issues • More on this later

Solid Work Flow (current)‏ • In preparation, Technician makes sure Images and Results are empty • Tech starts runs Tag1 (1 week)‏ • Tech deletes images from Tag1 to make space for Tag2 • Tech runs Tag2 (1 week)‏ • If QC passes, Technician deletes Tag2 images and gives me go-ahead • I email Brian from Solid to generate intensities for Tag1 (1-2 days)‏ • I copy fasta, quality and secondary data to our primary storage (hours)‏ • After generation, I copy and delete intensities for Tag1 (hours)‏ • I email Brian from Solid to generate intensities for Tag2 (1-2 days)‏ • I copy the primary data for Tag2, and delete everything else • I inform the Technician the sequencer is ready for the next run

Solid Work Flow (current)‏ • Why so tedious?

Solid Work Flow (current)‏ • Why so tedious? • Lack of space on the sequencer until next version

Solid Work Flow (current)‏ • Why so tedious? • Lack of space on the sequencer until next version • Unfortunate, since a $8k array would be enough for 2-3 full runs

Solid Automation Requirements • More on-sequencer disk space • Out of our control until the next version of Solid • With more disk space, intensities can be generated on-the-fly • After run finishes, SPCH and other temp files can be removed to free most of the storage used (currently needed for intensity generation)‏ • Integration with SeqTools • Brian has started the integration • Data migration is easy, but still some uncertainties as to what to keep • Alignment to be based on Nils' BFAST color space settings • Corona Lite can be run once John and Bret get it working • BFAST output converted to SLO/Bed/Wig can use common tools • Does not help that much, until things can be automated on Solid's side

State of Networking (now)‏

State of Networking (soon)‏

Nelson Lab Storage • Current design: • Keep things as simple and stable as possible • Data grows unevenly • Speed is limited to single node-to-server link • A given file only lives on primary server, so non linear scaling

Lab Storage • As with prior lab meetings, storage is still something we are watching • Commercial: Reliable, Vendor Lock In, Expensive Support Contracts • Isilon ($1500-$2500/TB), BlueArc • NAS design bottlenecks each transfer to a single head node • DDN Hardware w/ Lustre ($1250-$1750/TB)‏ • Do-It-Yourself: Unknown Stability, Potentially very faster, cheaper & more flexible • Lustre ($350-400/TB)‏ • Very fast, and allows for parallel I/O • Not HA, but could be good for scratch space • GlusterFS‏ ($350-400/TB, $700-800/TB with HA)‏ • Unknown performance and stability • Feature set allows for both parallel I/O and HA • ZFS w/ Many JBODs ($250-$300/TB)‏ • Stable filesystem, with all disks locally attached • Questions about how a single server will react to 100's of drives • Bottleneck will be single network connection, but fine for backup • Built-in snapshots, compression, etc

SeqWarehouse Project

SeqWarehouse Backend Storage • Relational Database • Full range of queries • Potential for very slow query time due to volume of data far exceeding RAM, and unintelligent ordering of data • Exactly how slow is still being determined • Berkeley DB • Key/Value Database stored in flat file • Very stable, large community and supports all common languages • Support for joins and range queries requires custom API on top • Flat Files • Requires the implementation and support of non-standard components • HDF5/NetCDF • I/O libraries for fast access, and extensive grouping and organization of data • No built-in indexing support

SeqWarehouse Backend Storage • HBase/others • Distributed, Replicated Store of Key/Value pairs/groups • Relatively new and developmental, but deployed in Petabyte scale • Keys are stored in sorted order for fast retrieval of similar keys • Designed for search: edu.ucla, edu.ucla.genome, edu.ucla.genome.solexa • Easily modeled as: hg18_chr1_pos12345, hg18_chr1_pos12401, hg18_chr1_pos1248, ... • Optional compression, timestamping • Access is not SQL-like. No joins, query engine, types. Syntax: • Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1"); • Scanner scanner = table.getScanner(new String[] "myColumnFamily:columnQualifier1"})‏ • RowResult rowResult = scanner.next(); • rowResult.getRow();

Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project

Other Possibly Useful Tools • Pig • Designed by Yahoo for ad-hoc analysis of large data sets on a Hadoop cluster • raw = LOAD '080930_lane1.tab_delim.txt' USING PigStorage('\t') AS (chr, pos, read, ref); • filtered_reads = FILTER raw BY pos < 1500000; • ordered_reads = ORDER filtered_reads BY (chr, pos); • unique_reads = DISTINCT ordered_reads; • just_reads = GROUP unique_reads BY (read, ref); • action_on_reads = FOREACH just_reads GENERATE foo($0), bar($1); • aggregate = JOIN action_on_reads BY $0, some_other_experiment BY $0; • STORE aggregate INTO '/tmp/aggregate-reads-of-interest' USING PigStorage(); • Hive • Designed by Facebook, also for large-scale analysis on a Hadoop cluster • More SQL/RDBMS like than Pig, with a strict schema • Less dynamic, and less information available about the project • Would anyone be interested in these types of tools, or are they too low-level? • What types of things would be most helpful to everyone?

Generic Thank You Slide • Thanks to: • Brian for help with pipeline and data warehousing project • Bret for help with storage and networking • Mike Yourshaw for input on schemas and data modeling • Patrick @ Ctrl for help with rewiring the patch panel • Stan • Everyone else

Genetics Data Pipeline Update: Challenges and Future Strategies at UCLA

Genetics Data Pipeline Update: Challenges and Future Strategies at UCLA

Presentation Transcript

Blue Belt Self Defense

Art from Spain Nelson-Atkins Museum

Ms. Bostrom and Mrs. Nelson’s Class

By Tommy Nelson

Nelson Creek – Removal of Fish Barrier

Nelson Mandela

Nelson Eby University of Massachusetts, Lowell, MA 01854, USA Nelson_Eby@uml

Nelson the Newt

Nelson Mandela

☺ Nelson mandela ☻

Biography of Nelson Mandela

Nelson Mandela 46664

Nelson Mandela

Infrastructure Update As at 11 th March 2009