Ensembl RNASeq Pipeline

EnsemblRNASeq Pipeline Steve Searle

Outline of Talk • What the EnsemblRNASeq pipeline does • Ensembl Pipeline infrastructure • How we’ve modified the process as its scaled up • Cloud version • Tips for running large compute

Uses of RNASeq in Ensembl • Ways we use RNASeq data in Ensembl: • Build complete gene set from scratch for individual or pooled RNASeq data sets • Incorporate into a new Ensembl gene set • Add novel models into a gene set • UTR • Filtering Models • Improve old gene sets • Expression levels (work in progress)

RNASeq pipeline using BWA Align reads Generate proto-transcripts using Exonerate Find splice sites Model generation against UniProt Filtering per tissue & merged RNA-Seq models

RNASeq Pipeline Alignment and Initial Processing • Reads are aligned to the genome with a quick un-gapped alignment using BWA • Transcriptome reads split over introns - we need to allow for this: • Align with up to 50% miss-matches to get intron spanning reads to align • The alignments are then processed to collapse overlapping reads into blocks representing exons • Read pairing is used (if available) to group the exon blocks into approximate transcript structures

RNASeq Pipeline Intron Alignment • We align split reads using Exonerate – has a good splice model but is not a short read aligner • Intron alignment is made faster in 2 ways: • Don’t realign all the reads: • Introns are resolved by realigning partially aligned reads. • Use Exonerate word length to define which reads to realign • Align to a single transcript: • Reads are realigned either to the rough transcript sequence or to the genomic span of the rough transcript. • Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length. • Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.

Final Models Collapsed Intron Features Split reads Exonerate spliced alignment Partially aligned reads

1 2 3 4 A* B C A* B D 1 2 3 4 5 Tissue 1 : 1-A-2-B-3 Tissue 2 : 1-A-2-B-3-D-4 1-A-2-C-4

Display • Data visible in Ensembl • Transcript models • Intron features • BAM files of BWA alignments

Nile tilapia: BAM files

Advantage of using RNASeq in the genebuild Some species have little specific data • Eg. Nile tilapia • 131 proteins in Uniprot • 35 cDNAs, 119531 ESTs • Rely on data from related species RNASeq supplements the above data • Species-specific • Fills gaps, alternate splice sites, filtering

Genebuild process Raw Computes Targetted stage Similarity stage cDNAs/ESTs Filtering UTR addition Merged RNA-Seq models Filtering Final gene set

RNASeq helps with:1. Choice of splice site RNASeq Similarity models Ensembl model

RNASeq helps with:2. UTR addition RNASeq model Similarity model Ensembl model

RNASeq helps with:3. New models & alt. splicing RNASeq introns RNASeq model Similarity model Ensembl model

Gene set Update Pipeline using RNASeq • RNASeq update pipeline is highly automated, many species take around a week to process • Split existing Ensembl gene set into single transcript genes • Transcript scoring / filtering • UTR addition done at the same time • Layering • avoiding pseudogenes • gap filling (includes some fragmented models) • Rebuild core set • Transfer pseudogenes + ncRNAs • Gene set update pipeline is fast and is using existing code in a novel way with very few alterations

Dealing with biological data issues • Retainedintrons / sequence of partiallyprocessedtranscripts • Secondaryalignmentcanbuildintronswithinroughexons • Wide variation in expressionlevels • In highlyexpressedgenesbackgroundexpressioncanlead to joining of genes • In lowlyexpressedgenesneedmerged set to getgoodmodel • Mixture of alternateisoformscomplicatesmodelbuilding • Running on individualtissuesgivesdifferentpattern of genesexpression, whichcanmakemodelseasier to build for somegenes, BUT comes at cost of extracompute

Retained intron problem

Ensemblpipelineinfrastructure • Runs on top of a batchqueuingsystem (weuse LSF), controllingsubmission of jobs • Important for trackingjob status - failuresoccur (maybeusingtoomuchmemory, failedwrites, hardware problems) • Resubmitsfailedjobs with differentsubmissionparameters • Runs on large Sanger computefarm (1000s of cores, largeLustrefilesystem)

Runnables and RunnableDBs Database • Runnables and RunnableDBs are wrappers RunnableDB Runnable RepeatMasker 24

Runnables and RunnableDBs • RunnableDB • Fetches sequence from the database • Passes data to Runnable • Fetches output from Runnable • Writes output to db • Runnable • Wrapper • Is stand-alone from the db • Accepts data from RunnableDB • Passes data to Module eg. Exonerate • Fetches output • Parses output and creates Ensembl objects • Passes output to RunnableDB 25

Core schema • Ensembl core schema • MySQL (MyISAM) • Tables containing biological data • Ensembl pipeline schema • Additional tables to the core tables • Tables containing job management data • API to access data • Uniform method of access to the data • Object-orientated Perl • cvs-controlled repositories: • ensembl • ensembl-pipeline • ensembl-analysis 26

Core schema Gene (Bio::EnsEMBL::Gene) Gene_id Analysis_id Biotype Seq_region_start Seq_region_end Seq_region_strand 459 Targetted_genewise Protein_coding 1000 2000 -1 Exon (Bio::EnsEMBL::Exon) Exon_id Rank Seq_region_start Seq_region_end Seq_region_strand 3562 2 1500 1700 -1 Transcript (Bio::EnsEMBL::Transcript) ) Transcript_id Gene_id Analysis_id Biotype Seq_region_start Seq_region_end Seq_region_strand 783 459 Targetted_genewise Protein_coding 1000 2000 -1 Exon_transcript Transcript_id Exon_id 783 3562 1 n 1 1 1 n 27

Pipeline schema • analysis: • analysis • rules: • rule_conditions • rule_goal • input_ids: • input_id_analysis • input_id_type_analysis • job-tracking tables: • job • job_status ensembl/sql/table.sql ensembl-pipeline/sql/table.sql 28

Configs and rules • Configuration files • General configuration: General.pm • PERL5LIB • Job control configuration in • Bio/Ensembl/Pipeline/Config/BatchQueue.pm • Analysis-specific configuration in • e.g. Bio/Ensembl/Analysis/Config/GeneBuild/RefineSolexaGenes.pm • Database-specific configuration in • Bio/Ensembl/Analysis/Config/Databases.pm 29

Configs and rules Analysis Analysis 1 Analysis 2 Analysis 3 SubmitContig RepeatMasker Pmatch Rule_goal Rule_conditions Analysis 2 Analysis 3 Goal 1 Goal 2 Goal 1 Goal 2 Goal 2 SubmitContig SubmitContig RepeatMasker • Rule tables 1 1 n 1 30

Configs and rules Input_id_type Analysis 1 Analysis 2 Analysis 3 Analysis 4 CONTIG CONTIG CONTIG GENOME Analysis Analysis 1 Analysis 2 Analysis 3 Analysis 4 SubmitContig RepeatMasker Pmatch BestPmatch 1 1 31

Input IDs • Tables • input_id_type_analysis: links analysis to type • input_id_analysis: links analysis to specific data • Input types • Sequence types e.g. Chromosome, Contig, Slice • specified using an Ensembl identifier • coord_system:asm:seq_region_id:start:end:strand • scaffold:gorGor1:Scaffold_1000:1:65750:1 • File eg. Fasta chunk of 100 cDNAs 32

Input IDs Input_id_analysis Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis RepeatMasker Analysis 1 Analysis 2 Analysis 3 SubmitContig RepeatMasker Pmatch Input_id_analysis Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis 2 Analysis 2 Analysis 2 Input_id 1 Input_id 2 Input_id 3 33

Jobs • A job is a single analysis processing a single input_id • Tables • job and job_status • Status of jobs • Pending, Reading, Running, Writing, AWOL, FAILED • Has is_current flag • Compute farm • Platform LSF used as scheduler 34

Jobs on the farm Job Input_id_analysis Job_id 1 Job_id 2 Job_id 3 Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Analysis 1 Analysis 1 Analysis 1 Input_id 1 Input_id 2 Input_id 3 Job_id 4 … Analysis 2 … Input_id 1 … Analysis 2 Analysis 2 Analysis 2 Input_id 1 Input_id 2 Input_id 3 1 1 1 n Job_status Job_id 2 Job_id 2 Job_id 2 Job_id 3 Pending Running Writing Pending - - Is_current Is_current 35

Jobs on the farm SubmitContig Contig Analysis executable parameters Job (input_id) Retry if failed Run on farm Store result to DB 36

Rulemanager • Set_up_pipeline • Get all rules and input_ids Set-up • Foreach input_id_type: • Foreach input_id: • Run job if • Goal type eq input_id_type • Analysis incomplete • Rule conditions are met • Job has failed and can be retried • Batch size is met Infinite loop 37

Dealing with Data GrowthRNASeqper release BAM file size across all tissues (GB) Ensembl release number

Evolution of the RNASeqbuildsystem First version Exonerate for bothalignmentstages Storedalignment data in database - doesn'tscale Second version Exonerate for bothalignmentstages Storedalignment data in BAM files, and onlygenemodels and supportingfeatures in database Third version BWA for firststage Exonerate for splicedalignments, useeitherroughmodelsorgenomicrange Storedalignment data in BAM files, and onlygenemodels and supportingfeatures in database Scripts to automatemore of the pipelinesetupprocess Using multi-threadedPicard and Samtools for some BAM filemanipulationsteps

Investigating effect of read depth on model generation • Converted from perl to C and further optimisation of the code • ~750x speed up • Allowing us to explore many different filtering combinations: • Read depth filters for consensus and non consensus introns • Intron length filters • Run on individual tissues (lower depth, different / simpler pattern of expression)

OptimisingRNASeq model building step • Rewriting in C gives ~90x speed up without any other optimisation • Multi-threaded Samtools to speed up reading intron alignments from BAM file • Smaller memory because its C rather than perl • Use tcmalloc • Small increase in speed • easier control of releasing memory back to the OS • Database reading done using few large queries • speed up fetching • Reduce load on database • Sequence fetching optimised • fetch complete sequence once and then do substrings • Optimised longest ORF finding code (single pass 6 frame translation) • Better ordering of data to improve access speed • Many other optimisations where profiler identified slowness

Multi-threaded version of Samtools • Modifications to samtools code by Nils Homer • https://github.com/nh13/samtools • Stops process being CPU bound when reading/writing BAM files due to compression/decompression overhead

Performance comparisonRNASeq model building(Sheep chr 26 refine_all) CPU time Perl: 48586 sec C: 65.5 sec 741x Memory Perl: 2.6Gb C: 500Mb Run on: Intel Xeon X5650 @ 2.67GHz (12288 KB cache)

Sheep chr 1 1-62Mb RNASeq models(72 sets of filtering params)

Sheep 1:1-50Mb RNASeq gene models in 90 different samples

RNASeq pipeline in the cloud Stephen Keenan in Ensembl is working to create a system to run the EnsemblRNASeq pipeline in the cloud

Dynamic Cloud Architecture StarCluster master User datax 100 GBs Control Deploy x TBs shared across all nodes 7+ GB/s data throughput Storage: GlusterFS Compute: SUN Grid Engine HPC Cluster

Sun Grid Engine • 400 Job Slots • 8 core / 8 Gb hosts • 300 Gb GlusterFS

Ensembl RNASeq Pipeline

Ensembl RNASeq Pipeline

Presentation Transcript

Ensembl Compara Perl API

Sequence Variation in Ensembl

Investigating Genomes with Ensembl

Genomes with Ensembl

RNAseq Applications in Genome Studies

RNASeq Gary Williams

RNAseq

RNAseq analysis

The Ensembl Variation API

Ensembl Funcgen Perl API

Ensembl Genome Annotation Overview

The Ensembl Database Schema

Ensembl Developers Meeting

HSC RNAseq Initial Results

Introduction to RNAseq

Ensembl

RNAseq practical course software requirements

EnsEMBL

Genomic Database - Ensembl

Sequence Variation in Ensembl

Ensembl website refactoring James Smith Js5@sanger.ac.uk Ensembl Web Team Project Leader

Ensembl