1 / 56

Ensembl RNASeq Pipeline

Ensembl RNASeq Pipeline. Steve Searle. Outline of Talk. What the Ensembl RNASeq pipeline does Ensembl Pipeline infrastructure How we’ve modified the process as its scaled up Cloud version Tips for running large compute. Uses of RNASeq in Ensembl.

alvaro
Download Presentation

Ensembl RNASeq Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EnsemblRNASeq Pipeline Steve Searle

  2. Outline of Talk • What the EnsemblRNASeq pipeline does • Ensembl Pipeline infrastructure • How we’ve modified the process as its scaled up • Cloud version • Tips for running large compute

  3. Uses of RNASeq in Ensembl • Ways we use RNASeq data in Ensembl: • Build complete gene set from scratch for individual or pooled RNASeq data sets • Incorporate into a new Ensembl gene set • Add novel models into a gene set • UTR • Filtering Models • Improve old gene sets • Expression levels (work in progress)

  4. RNASeq pipeline using BWA Align reads Generate proto-transcripts using Exonerate Find splice sites Model generation against UniProt Filtering per tissue & merged RNA-Seq models

  5. RNASeq Pipeline Alignment and Initial Processing • Reads are aligned to the genome with a quick un-gapped alignment using BWA • Transcriptome reads split over introns - we need to allow for this: • Align with up to 50% miss-matches to get intron spanning reads to align • The alignments are then processed to collapse overlapping reads into blocks representing exons • Read pairing is used (if available) to group the exon blocks into approximate transcript structures

  6. RNASeq Pipeline Intron Alignment • We align split reads using Exonerate – has a good splice model but is not a short read aligner • Intron alignment is made faster in 2 ways: • Don’t realign all the reads: • Introns are resolved by realigning partially aligned reads. • Use Exonerate word length to define which reads to realign • Align to a single transcript: • Reads are realigned either to the rough transcript sequence or to the genomic span of the rough transcript. • Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length. • Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.

  7. Final Models Collapsed Intron Features Split reads Exonerate spliced alignment Partially aligned reads

  8. 1 2 3 4 A* B C A* B D 1 2 3 4 5 Tissue 1 : 1-A-2-B-3 Tissue 2 : 1-A-2-B-3-D-4 1-A-2-C-4

  9. Display • Data visible in Ensembl • Transcript models • Intron features • BAM files of BWA alignments

  10. Nile tilapia: BAM files

  11. Advantage of using RNASeq in the genebuild Some species have little specific data • Eg. Nile tilapia • 131 proteins in Uniprot • 35 cDNAs, 119531 ESTs • Rely on data from related species RNASeq supplements the above data • Species-specific • Fills gaps, alternate splice sites, filtering

  12. Genebuild process Raw Computes Targetted stage Similarity stage cDNAs/ESTs Filtering UTR addition Merged RNA-Seq models Filtering Final gene set

  13. RNASeq helps with:1. Choice of splice site RNASeq Similarity models Ensembl model

  14. RNASeq helps with:2. UTR addition RNASeq model Similarity model Ensembl model

  15. RNASeq helps with:3. New models & alt. splicing RNASeq introns RNASeq model Similarity model Ensembl model

  16. Gene set Update Pipeline using RNASeq • RNASeq update pipeline is highly automated, many species take around a week to process • Split existing Ensembl gene set into single transcript genes • Transcript scoring / filtering • UTR addition done at the same time • Layering • avoiding pseudogenes • gap filling (includes some fragmented models) • Rebuild core set • Transfer pseudogenes + ncRNAs • Gene set update pipeline is fast and is using existing code in a novel way with very few alterations

  17. Dealing with biological data issues • Retainedintrons / sequence of partiallyprocessedtranscripts • Secondaryalignmentcanbuildintronswithinroughexons • Wide variation in expressionlevels • In highlyexpressedgenesbackgroundexpressioncanlead to joining of genes • In lowlyexpressedgenesneedmerged set to getgoodmodel • Mixture of alternateisoformscomplicatesmodelbuilding • Running on individualtissuesgivesdifferentpattern of genesexpression, whichcanmakemodelseasier to build for somegenes, BUT comes at cost of extracompute

  18. Retained intron problem

  19. Ensemblpipelineinfrastructure • Runs on top of a batchqueuingsystem (weuse LSF), controllingsubmission of jobs • Important for trackingjob status - failuresoccur (maybeusingtoomuchmemory, failedwrites, hardware problems) • Resubmitsfailedjobs with differentsubmissionparameters • Runs on large Sanger computefarm (1000s of cores, largeLustrefilesystem)

  20. Runnables and RunnableDBs Database • Runnables and RunnableDBs are wrappers RunnableDB Runnable RepeatMasker 24

  21. Runnables and RunnableDBs • RunnableDB • Fetches sequence from the database • Passes data to Runnable • Fetches output from Runnable • Writes output to db • Runnable • Wrapper • Is stand-alone from the db • Accepts data from RunnableDB • Passes data to Module eg. Exonerate • Fetches output • Parses output and creates Ensembl objects • Passes output to RunnableDB 25

  22. Core schema • Ensembl core schema • MySQL (MyISAM) • Tables containing biological data • Ensembl pipeline schema • Additional tables to the core tables • Tables containing job management data • API to access data • Uniform method of access to the data • Object-orientated Perl • cvs-controlled repositories: • ensembl • ensembl-pipeline • ensembl-analysis 26

  23. Core schema Gene (Bio::EnsEMBL::Gene) Gene_id Analysis_id Biotype Seq_region_start Seq_region_end Seq_region_strand 459 Targetted_genewise Protein_coding 1000 2000 -1 Exon (Bio::EnsEMBL::Exon) Exon_id Rank Seq_region_start Seq_region_end Seq_region_strand 3562 2 1500 1700 -1 Transcript (Bio::EnsEMBL::Transcript) ) Transcript_id Gene_id Analysis_id Biotype Seq_region_start Seq_region_end Seq_region_strand 783 459 Targetted_genewise Protein_coding 1000 2000 -1 Exon_transcript Transcript_id Exon_id 783 3562 1 n 1 1 1 n 27

  24. Pipeline schema • analysis: • analysis • rules: • rule_conditions • rule_goal • input_ids: • input_id_analysis • input_id_type_analysis • job-tracking tables: • job • job_status ensembl/sql/table.sql ensembl-pipeline/sql/table.sql 28

  25. Configs and rules • Configuration files • General configuration: General.pm • PERL5LIB • Job control configuration in • Bio/Ensembl/Pipeline/Config/BatchQueue.pm • Analysis-specific configuration in • e.g. Bio/Ensembl/Analysis/Config/GeneBuild/RefineSolexaGenes.pm • Database-specific configuration in • Bio/Ensembl/Analysis/Config/Databases.pm 29

  26. Configs and rules Analysis Analysis 1 Analysis 2 Analysis 3 SubmitContig RepeatMasker Pmatch Rule_goal Rule_conditions Analysis 2 Analysis 3 Goal 1 Goal 2 Goal 1 Goal 2 Goal 2 SubmitContig SubmitContig RepeatMasker • Rule tables 1 1 n 1 30

  27. Configs and rules Input_id_type Analysis 1 Analysis 2 Analysis 3 Analysis 4 CONTIG CONTIG CONTIG GENOME Analysis Analysis 1 Analysis 2 Analysis 3 Analysis 4 SubmitContig RepeatMasker Pmatch BestPmatch 1 1 31

  28. Input IDs • Tables • input_id_type_analysis: links analysis to type • input_id_analysis: links analysis to specific data • Input types • Sequence types e.g. Chromosome, Contig, Slice • specified using an Ensembl identifier • coord_system:asm:seq_region_id:start:end:strand • scaffold:gorGor1:Scaffold_1000:1:65750:1 • File eg. Fasta chunk of 100 cDNAs 32

  29. Input IDs Input_id_analysis Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis RepeatMasker Analysis 1 Analysis 2 Analysis 3 SubmitContig RepeatMasker Pmatch Input_id_analysis Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis 2 Analysis 2 Analysis 2 Input_id 1 Input_id 2 Input_id 3 33

  30. Jobs • A job is a single analysis processing a single input_id • Tables • job and job_status • Status of jobs • Pending, Reading, Running, Writing, AWOL, FAILED • Has is_current flag • Compute farm • Platform LSF used as scheduler 34

  31. Jobs on the farm Job Input_id_analysis Job_id 1 Job_id 2 Job_id 3 Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Job_id 4 … Analysis 2 … Input_id 1 … Analysis 2 Analysis 2 Analysis 2 Input_id 1 Input_id 2 Input_id 3 1 1 1 n Job_status Job_id 2 Job_id 2 Job_id 2 Job_id 3 Pending Running Writing Pending - - Is_current Is_current 35

  32. Jobs on the farm SubmitContig Contig Analysis executable parameters Job (input_id) Retry if failed Run on farm Store result to DB 36

  33. Rulemanager • Set_up_pipeline • Get all rules and input_ids Set-up • Foreach input_id_type: • Foreach input_id: • Run job if • Goal type eq input_id_type • Analysis incomplete • Rule conditions are met • Job has failed and can be retried • Batch size is met Infinite loop 37

  34. Dealing with Data GrowthRNASeqper release BAM file size across all tissues (GB) Ensembl release number

  35. Evolution of the RNASeqbuildsystem First version Exonerate for bothalignmentstages Storedalignment data in database - doesn'tscale  Second version Exonerate for bothalignmentstages Storedalignment data in BAM files, and onlygenemodels and supportingfeatures in database  Third version BWA for firststage Exonerate for splicedalignments, useeitherroughmodelsorgenomicrange Storedalignment data in BAM files, and onlygenemodels and supportingfeatures in database Scripts to automatemore of the pipelinesetupprocess Using multi-threadedPicard and Samtools for some BAM filemanipulationsteps

  36. Investigating effect of read depth on model generation • Converted from perl to C and further optimisation of the code • ~750x speed up • Allowing us to explore many different filtering combinations: • Read depth filters for consensus and non consensus introns • Intron length filters • Run on individual tissues (lower depth, different / simpler pattern of expression)

  37. OptimisingRNASeq model building step • Rewriting in C gives ~90x speed up without any other optimisation • Multi-threaded Samtools to speed up reading intron alignments from BAM file • Smaller memory because its C rather than perl • Use tcmalloc • Small increase in speed • easier control of releasing memory back to the OS • Database reading done using few large queries • speed up fetching • Reduce load on database • Sequence fetching optimised • fetch complete sequence once and then do substrings • Optimised longest ORF finding code (single pass 6 frame translation) • Better ordering of data to improve access speed • Many other optimisations where profiler identified slowness

  38. Multi-threaded version of Samtools • Modifications to samtools code by Nils Homer • https://github.com/nh13/samtools • Stops process being CPU bound when reading/writing BAM files due to compression/decompression overhead

  39. Performance comparisonRNASeq model building(Sheep chr 26 refine_all) CPU time Perl: 48586 sec C: 65.5 sec 741x Memory Perl: 2.6Gb C: 500Mb Run on: Intel Xeon X5650 @ 2.67GHz (12288 KB cache)

  40. Sheep chr 1 1-62Mb RNASeq models(72 sets of filtering params)

  41. Sheep 1:1-50Mb RNASeq gene models in 90 different samples

  42. RNASeq pipeline in the cloud Stephen Keenan in Ensembl is working to create a system to run the EnsemblRNASeq pipeline in the cloud

  43. Dynamic Cloud Architecture StarCluster master User datax 100 GBs Control Deploy x TBs shared across all nodes 7+ GB/s data throughput Storage: GlusterFS Compute: SUN Grid Engine HPC Cluster

  44. Sun Grid Engine • 400 Job Slots • 8 core / 8 Gb hosts • 300 Gb GlusterFS

More Related