330 likes | 339 Views
A comprehensive guide on analyzing deep sequencing data using various techniques such as mutation and SNP identification, gene-disease linkage, pathogen identification, transcriptome analysis, DNA methylation study, chromatin study, and more.
E N D
Mouse Genetics January 24, 2013, 14:00–16:30 (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr http://drosophile.org
Why deep seq analyses ? • Your project involves: • Mutation and SNP identification or analysis (genome re-sequencing) • Gene/Disease Linkage (genome re-sequencing) • Pathogen identification (de novo sequence assembly or re-sequencing) • transcriptome analysis (RNAseq) • DNA methylation study (medip-seq) • Chromatin study (ChIPseq) • Transcription factor study (ChIPseq) • miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq) • Single cell transcriptome analysis Qualitative information Deep seq Quantitative information
Sequencing Technologies : Focus on Illumina technology « Library » « Clusters» Cluster Sequences
small RNAseq library preparation (Directional) (Biases) 20-30nt RNA gel purification Library “Bar coding”
ChIPseq library preparation (Non Directional)
What can I do with my sequence reads ? • Locus discovery/mutation discovery/Splicing annotation • Annotation & visualization • Read quantitative profiling (Transcriptome, chromatin profiling, etc..) • Statistics • Structure analysis of precursors, signatures… • Maths & Statistics • …
What am I going to sequence ? Flowchart of a sequencing project Platform Selection Specific benefits (Read length, single or paired ends, number of reads) Inherent biases Whole genome Whole exome Target enrichment Size selection Amplification Single Cell Protocol Library Preparation Sequencing Number of Cycles Number of lanes Quality Control Adapter Clipping Quality trimming Contaminant and Error identification Bowtie BWA…… Nature Methods 2009 P Flicek & E Birney Velvet SSAKE…… PLoS ONE 6(3) Zhang W, Chen J, et al. (2011) Alignment Assembly • Visualization & Statistics • Normalization (library comparison) • Peak finding (Binding sites, Breakpoints, etc…) • Differential Calling (expression, variants, etc) • Think to the number of replicate when starting R & Open Source software tools
A case study: miRNAs (and other small RNAs) Hen1 met + snoRNA, tRNA, rRNA fragments
small RNAseq library preparation (Directional) (Biases) 20-30nt RNA gel purification Library “Bar coding”
Basic Material • A sequence file (fastq format) • A computer with enough RAM (8 Gigabytes is a good start) • A Unix compliant Operating System + a bit of « basic know how » • A couple of very useful softwares with Graphic User Interface (GUI) • TextWrangler, an advanced text editor with RegEx integration • R (for statistics and, more importantly, Graphics) • … • GALAXY is an (our) option • Knowledge of at least one programming language • Python, Perl, Java, C++…
What is this big* fastq file containning ? • * Size limit to open a text file with a text editor (~1.2 Gb) • Unix Terminal . • more <path/to/the/file> $more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh @HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 ]B]VWaaaaaagggfggggggcggggegdgfgeggbab @HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh @HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff @HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT +HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 aB\^^eeeeegcggfffffffcfffgcgcfffffR^^] @HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC +HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^ Header Sequence Header Quality (ascii encoded) … … …
How many sequence reads in my file ? • wc - l <path/to/my/file> $wc -l GKG-13.fastq 25703828 GKG-13.fastq >>> 25 703 828 / 4 6 425 957 sequence reads $fastq_complexity.py GKG-13.fastq 6 425 957 reads 550 706 distinct sequences 0.085700 complexity #!/usr/bin/python import sys readDic= {} Nbre_reads = 0 Nbre_lines = 0 F = open(sys.argv[1]) for line in F: Nbre_lines += 1 if Nbre_lines % 4 == 2: Nbre_reads += 1 readDic[line] = readDic.get(line, 0) + 1 F.close() print "%s reads" % Nbre_reads print "%s distinct sequences" % (len(readDic)) print "%f complexity" % (len(readDic)/float(Nbre_reads))
Are my sequence reads containing the adapter ? My 3’ adaptater: CTGTAGGCACCATCAAT • cat <path/file> | grep CTGTAGG | wc –l • grep -c "CTGTAGG" <path/file> lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l 6355061 lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq 6355061 6 355 061 out of 6 425 957 sequences … not bad (98.8%) A contrario lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep ATCTCGT| wc -l 308
Take home message n°1 Unix Operating Systems already contain powerful native tools for text analysis Regular expression $ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l wc with –l option counts the lines Outputs the content of a file, line by line perl interpreter is called with –ne options (loop & execute) The output is passed to the input of the next command The output is passed to the input of the next command In line perl code
Quality Control. Can I trust my sequences ? http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Demo with the GUI version
Quality Control. Can I trust my sequences ? fastQC in GALAXY
How to clip the adapter ? http://hannonlab.cshl.edu/fastx_toolkit/index.html 3’ adapter: CTGTAGGCACCATCAAT fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta
http://bowtie-bio.sourceforge.net/ Bowtie aligns reads on indexed genomes • Download, install Bowtie and rtfm. • Download your genome (format FASTA) • Build the Bowtie index using bowtie-build deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37 ~nn min (but indexed references available) deepseq$ ls –laht -rw-r--r-- 1 deepseq staff 49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt -rw-r--r-- 1 deepseq staff 19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt -rw-r--r-- 1 deepseq staff 49M Mar 24 17:20 dmel-r5.37.1.ebwt -rw-r--r-- 1 deepseq staff 19M Mar 24 17:20 dmel-r5.37.2.ebwt -rw-r--r-- 1 deepseq staff 331K Mar 24 17:16 dmel-r5.37.3.ebwt -rw-r--r-- 1 deepseq staff 39M Mar 24 17:16 dmel-r5.37.4.ebwt
A bowtie alignment (Demo on Mac) Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated # reads processed: 5997502 # reads with at least one reported alignment: 5045151 (84.12%) # reads that failed to align: 952351 (15.88%) Reported 5045151 alignments to 1 output stream(s)
Bowtie outputs deepseq$ ls -laht -rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated -rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa -rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa Tabular alignment report deepseq$ more GKG13_bowtie_output.tabulated 21 + 2L 20487495 TGGAATGTAAAGAAGTATGGAG 30 - 3L 15836559 GTGAATTCTCCCAGTGCCAAG 25 + 3R 5916902 TGAACACAGCTGGTGGTATCC 23 - 2L 11953462 CCCGTGAATTCTTCCAGTGCCATT 27 + 3R 5916902 TGAACACAGCTGGTGGTATC 26 - 3R 9289997 TCCTGCGGCACTAGTACTTA 18 - 2L 11953465 GTGAATTCTTCCAGTGCCATT 22 - 3R 8377246 ATTGCTGGAATCAAGTTGCTGAC 20 + 3L 11650036 TTTGTGACCGACACTAACGGGTA 24 + 2R 16493585 TGGAAGACTAGTGATTTTGTT 28 + 3L 10358380 TAGGAACTTCATACCGTGCTCT 35 + X 18022302 CTTGTGCGTGTGACAGCGGCT 41 - 3RHet 138608 TGGCGACCGTGACAGGACCCG 42 + 3R 5916902 TGAACACAGCTGGTGGTATCC Aligned sequences Unaligned sequences deepseq$ more droso_matched_GKG-13.fa >21 TGGAATGTAAAGAAGTATGGAG >26 TAAGTACTAGTGCCGCAGGA >24 TGGAAGACTAGTGATTTTGTT >23 AATGGCACTGGAAGAATTCACGGG >27 TGAACACAGCTGGTGGTATC deepseq$ more unmatched_GKG13.fa >29 AGGGGGCTATTTCACTACTGGA >33 CGATGATGACGGTACCCGTAGA >37 GCTAGTCGGTACTTGAAAC >59 TGGTTGCAATAGCTTCTGGCGGA >61 GATGAGTGCTAGATGTAGGGA
A pipeline for small RNA annotation (see in GED Galaxy) Sequence reads (fasta format) Pre-miRNAs (miRBase) Matched reads (fasta) Matched reads (fasta) Matched reads (fasta) Matched reads (fasta) Matched reads (fasta) Matched reads (fasta) Read Count Read Count Read Count Read Count Read Count Read Count Bowtie Unmatched reads Bowtie Non coding RNAs Unmatched reads hierarchical annotation of sequence datasets Bowtie Transposons Unmatched reads Bowtie Genes Unmatched reads Bowtie Intergenic regions Unmatched reads Bowtie Viruses, transgenes, etc… Remaining unmatched sequences
samtools http://samtools.sourceforge.net/ • Sam format • Bam format (for Genome Browsers) • Sorted • Indexed • Compressed Preparation of a BAM file and its associated index $ bowtie~/bin/bowtie-0.12.7/indexes/dmel-f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtoolsview -bS -o GKG-13_clip-pipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ; samtools index GKG-13_clip-pipe.fasta.bam.sorted.bam 306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai 42M GKG-13_clip-pipe.fasta.bam.sorted.bam 80M GKG-13_clip-pipe.fasta.bam ~3 min
Read visualization in a Genome Browser Upload of BAM file to a remote server (amazon cloud) Passing the URL to Ensembl (Gbrowse, Modencode, etc..)
Naive and primed murine pluripotent stem cells have distinct miRNA signatures • Jouneau (INRA Jouy en Josas) • E. Heard (Institut Curie) • Antoniewski (Institut Pasteur) • M. Cohen-Tannoudji (Institut Pasteur) ESC1 ESC2 EpiSC3 EpiSC1 EpiSC2
miRNA profiling deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt Sequence reads (fasta format) Pre-miRNAs (miRBase) Bowtie Bowtie Output Parsing • Read maps for all miRNAs • Hit list for miR_5p and miR_3p miR profiling / hit list agregation
Differential Calling deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt Sequence reads (fasta format) Bowtie Pre-miRNAs (miRBase) Bowtie Output Parsing • Read maps for all miRNAs • Hit list for miR_5p et miR_3p http://www.r-project.org/ DESeq Heatplus (Bioconductor)
touRism Load DESeq, gplots and RcolorBrewer • countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE ) • head(countsTable) • rownames(countsTable)<- countsTable$gene • countsTable <- countsTable[ , -1 ] • head(countsTable) • summary(countsTable) • plot(countsTable) • plot(log(countsTable,10)) • conds <- c( "EPI", "EPI", "EPI", "ES", "ES" ) • cds <- newCountDataSet( countsTable, conds ) • cds <- estimateSizeFactors( cds ) • sizeFactors( cds ) • cds = estimateDispersions( cds, method="pooled") • vsd <- getVarianceStabilizedData( cds ) • dists <- dist( t( vsd ) ) • heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1) • SampleVar<-apply(vsd,1,var) • vsd2<-cbind(vsd,SampleVar) • vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),] • head(vsd3) • vsd3<-head(vsd3,100) • vsd3<-vsd3[,-6] • head(vsd3) • heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column", density.info="none", keysize=0.7) • cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only") • res = nbinomTest( cds, "EPI", "ES" ) • resNA = res[-which(is.na(res[,8])),] • resNA[order(resNA[,8]), ]
Deep Seq Data Analysis, Final Take Home Messages • Think to your deep seq replicates at starting • Keep a hand on your data, from « fastq stage » • Keep a hand on the analysis because this is your project • Always keep an eye on « Normalization » and « Differential » • Don’t be afraid by bioinformatics, but don’t reinvent the wheel • It’s open source, open manual • It’s not magic, yes you can • It’s fun • You cannot escape, so take it easy.