470 likes | 601 Views
Bioinformatics for Stem Cell Lecture 2. Debashis Sahoo , PhD. Outline. Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis. Multivariate Analysis. Identify Markers of Human Colon Cancer and Normal Colon. Piero Dalerba. Tomer Kalisky.
E N D
Bioinformatics for Stem CellLecture 2 DebashisSahoo, PhD
Outline • Lecture 1 Recap • Multivariate analysis • Microarray data analysis • Boolean analysis • Sequencing data analysis
Identify Markers of Human Colon Cancer and Normal Colon Piero Dalerba Tomer Kalisky
Hierarchical Clustering • Cluster 3.0 • http://bonsai.hgc.jp/~mdehoon/software/cluster/ • Distance metric • Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation • Linkage • Single, complete, average, median, centroid
Multivariate Analysis - PCA Principal Component Analysis X = data matrix V = loading matrix U = scores matrix
Fundamentals of PCA • Reduces dimensions of the data • PCA uses orthogonal linear transformation • First principal component has the largest possible variance. • Exploratory tool to uncover unknown trends in the data
Microarray • Spotted vs. in situ • Two channel vs. one channel • Probe vs. probeset vs. gene
Quantile Normalization #1 #2 #3 SortedAvg Average Sort Val(Probe_i) = SortedAvg[Rank(Probe_i)]
Invariant Set Normalization Before Normalization Invariant set After Normalization
Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 • Assign experiments to two groups, e.g., in the expression matrix • below, assign Experiments 1, 2 and 5 to group A, and • experiments 3, 4 and 6 to group B. SAM Two-Class Unpaired 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?
Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 SAM Two-Class Unpaired Permutation tests • For each gene, compute d-value (analogous to t-statistic). This is • the observed d-value for that gene. • ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Original grouping Randomized grouping
SAM Two-Class Unpaired iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values
Significant positive genes (i.e., mean expression of group B > mean expression of group A) SAM Two-Class Unpaired “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B)
GenePattern http://genepattern.broadinstitute.org/
AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117
Gene Set Analysis Your Gene Set Cell Cycle Transcription factor Compute enrichment in pathways and networks TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING
Boolean Implication • Analyze pairs of genes. • Analyze the four different quadrants. • Identify sparse quadrants. • Record the Boolean relationships. • If ACPP high, then GABRB1 low • If GABRB1 high, then ACPP low 45,000 Affymetrix microarrays GABRB1 ACPP [Sahoo et al. Genome Biology 08]
Intermediate Threshold Threshold Calculation • A threshold is determined for each gene. • The arrays are sorted by gene expression • StepMiner is used to determine the threshold High CDH expression Low Sorted arrays [Sahoo et al. 07]
(expected – observed) statistic = √ expected B A a00 ( ) a00 a01 a11 1 error rate = + (a00+ a01) (a00+ a10) 2 a00 a10 BooleanNet Statistics nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total Boolean Implication = (statistic > 3, error rate < 0.1) [Sahoo et al. Genome Biology 08]
Six Boolean Implications [Sahoo et al. Genome Biology 08]
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]
B Cell Genes KIT CD19 Boolean Implications [Sahoo et al. PNAS 2010]
Jun Seita http://gexc.stanford.edu [Seita, Sahoo et al. PLoS ONE, 2012]
Sequencing Data Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)
Mapping Software • Long reads • BLAST, HMMER, SSEARCH • Short reads • BLAT • Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA
Visualizations • UCSC Genome Browser • GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 • Integrative Genomics Viewer (IGV)
Quantification • Peak calling • QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT • Expression quantification • Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ • SNP calling • samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH
Peak Discovery [Pepke et al. Nature Methods 2009]
Transcript Quantification RPKM, FPKM [Pepke et al. Nature Methods 2009]
Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010]