560 likes | 693 Views
An introduction to quantitative biology and R. David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo. Molecular biology 20 years ago. qualitative methods small-scale quantitative tests.
E N D
An introduction to quantitative biology and R David Quigley dquigley@cc.ucsf.edu Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo
Molecular biology 20 years ago qualitative methods small-scale quantitative tests Suzuki Med Mol Morph 2010 Oh PNAS 1996 Mao Genes Dev 2004 David Quigley dquigley@cc.ucsf.edu
Molecular biology now qualitative methods small-scale quantitative tests large-scale quantitative analysis microarrays, *seq methods, cell phenotype screens... Nik-Zainal Cell 2012 Fullwood Nature 2009 CGAN Nature 2012 David Quigley dquigley@cc.ucsf.edu
Many studies use both approaches Hypothesis-generating: Normalize breast tumor expression data (three cohorts) Call genotypes from SNP arrays Calculate association between genotype and expression genome-wide (eQTL) Identify an interesting candidate Valid ate eQTL in independent cohort Hypothesis-driven: Methylation analysis at MRPS30 promoter In vitro (two cell lines): ChIP-PCR +/- estrogen qPCR sequencing David Quigley dquigley@cc.ucsf.edu
Common quantitative techniques Gene Expression transcription splicing methylation protein binding (ChIP-seq) Genomics and Genetics de novo assembly DNA copy number (CNV and tumors) germline variant analysis tumor variant analysis SNPs, indels, translocation analysis of tumor clonality
Challenges requires statistical sophistication in study design in interpretation many data points 1,000 to 1,000,000 measurements per sample many false positives which look like great stories software becomes part of the experiment divide between engineering, biology culture & thinking David Quigley dquigley@cc.ucsf.edu
Wet lab and quantitative skills: better job prospects Schillebeeckx Nature Biotech. 2013 David Quigley dquigley@cc.ucsf.edu
What approachs are used to analyze quantitative data?
Chosing a tool Cost Learning curve Ease of use Flexibility (closed to open-ended) Software ecosystem (none to extensive) David Quigley dquigley@cc.ucsf.edu
Traditional programming languages Python, C++, Java, others can solve any computable problem creates the fastest tools free requires programming expertise complex to write and test high effort David Quigley dquigley@cc.ucsf.edu
Specialized single-purpose programs command line tools academic research type commands at a prompt or run scripts PLINK, bowtie, GATK, bedtools GUI (point and click) commercial software for a vendor’s platform slick, opaque, hard/impossible to automate David Quigley dquigley@cc.ucsf.edu
Commercial statistics programs STATA, SPSS, GraphPad, others 1) Load one dataset 2) Select analysis by clicking on a GUI 3) Generate a report May have a built-in language Very mature tools for traditional biostatistics Not free David Quigley dquigley@cc.ucsf.edu
Web-based tools Galaxy string together pre-defined analysis steps very easy to use reproducible David Quigley dquigley@cc.ucsf.edu
R: a “software environment” Using R is like writing and using software Traditionally, biologists did not do this. R was written by statisticians to be a free replica of another language called “S”. David Quigley dquigley@cc.ucsf.edu
Why is R popular? Flexible, open-ended, open-source Large library of packages package: easy-to use published methods like a Qiagen kit Free! David Quigley dquigley@cc.ucsf.edu
You use R by typing at the prompt There is no pull-down menu of statistical commands David Quigley dquigley@cc.ucsf.edu
What’s good about this approach? chain analyses work with multiple datasets use packages of code easy to reproduce runs on anything makes sense to computer programmers David Quigley dquigley@cc.ucsf.edu
What’s hard about this approach? hard to get started cryptic commands built-in help is amusingly unhelpful David Quigley dquigley@cc.ucsf.edu
packages: collections of R functions collection of R code that solves a specific task limma: microarray normalization and analysis samr: differential expression impute: dealing with missing data downloaded for free from a central repository David Quigley dquigley@cc.ucsf.edu
bioconductor Curated collection of R packages Microarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more bioconductor.org David Quigley dquigley@cc.ucsf.edu
Learning R data types by comparing them to Excel spreadsheets
Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible David Quigley dquigley@cc.ucsf.edu
Comparing Excel and R Excel Easy tasks are easy non-trivial tasks impossible or expensive No paper trail Mangles gene names Plots look terrible R Easy jobs are hard at first Non-trivial things are possible Easy to make a paper trail Biostatistics researchers publish tools in R Can create publication-ready plots David Quigley dquigley@cc.ucsf.edu
Organizing data in Excel Each subject has a row. Each column has a feature of your subjects. David Quigley dquigley@cc.ucsf.edu
R calls the data points variables variables numbers and characters (letters, words) numbers: 2.6, 4 characters: “Flopsy”, “white, brown paws” David Quigley dquigley@cc.ucsf.edu
R calls the columns vectors vectors ordered collections of a variable name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”] age: [2.5, 2.6, 2.5, 4] David Quigley dquigley@cc.ucsf.edu
R calls the data set a data frame data frame a list of vectors (columns) that have names elements can be read and written by row & column David Quigley dquigley@cc.ucsf.edu
I can slice and dice the data frame David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions function_name( details about how to do it ) generate sequence from 1 to 5 counting by 0.5 parameters for seq are named from, to, and by David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions function_name( details about how to do it ) report the mean of my.data. Result of one function is fed into another one. David Quigley dquigley@cc.ucsf.edu
Tell R to do things using functions function_name( details about how to do it ) define a new function that adds 2 to whatever it’s passed compare to original value of my.data David Quigley dquigley@cc.ucsf.edu
Walk-through a straightforward analysis
Primary data from METABRIC study gene expression TP53sequence 1,400 samples from 5 hospitals Is there an association between breast cancer subtype and TP53 mutation? David Quigley dquigley@cc.ucsf.edu
Tasks Normalize data batch effects unwanted inter-sample variation Identify outliers associations between p53 and subtype David Quigley dquigley@cc.ucsf.edu
Quantile Normalization (limma) Force every array to have the same distribution of expression intensities > library(limma) > raw = read.table('raw_extract.txt’, ...) > raw.normalized = normalize.quantiles( raw ) > normalized = log2( raw.normalized ) David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix gene 1 gene 2 David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays Principle Components Analysis Identify axes of maximal variation in a matrix gene 1 first principle component gene 2 David Quigley dquigley@cc.ucsf.edu
Identify batch effects in microarrays Principle Components Analysis Identify strongest variation in a matrix group A group B gene 1 gene 1 gene 2 gene 2 David Quigley dquigley@cc.ucsf.edu
PCA of identifies a batch effect second principle component hospital 3 (yellow) first principle component > my.pca = prcmp( t( expression.data ) ) > plot( my.pca, ... ) David Quigley dquigley@cc.ucsf.edu
batch correction reduces bias (ComBat) ComBat package reduces user-defined batch effects second principle component first principle component David Quigley dquigley@cc.ucsf.edu
Molecular subtypes of breast carcinoma, defined by gene expression ER status Luminal A N=507 Luminal B N=379 Her2 N=161 Basal N=234 > sa = read.table(‘patients.txt’, ...) > tumor.counts = table( sa$ER.status, sa$PAM50Subtype) (convert counts to percentages) > barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... ) David Quigley dquigley@cc.ucsf.edu
Find interactions: TP53 and subtype Fit a linear model: > fitted.model = lm( dependent ~ independent ) Perform Analysis of Variance: > anova( fitted.model ) general form of my analysis: > anova( lm( gene.expression ~ PAM * TP53 ) 18,000 genes PAM: {LumA, LumB, Her2, Basal} TP53: {mutant, WT} David Quigley dquigley@cc.ucsf.edu
Automate with loops Calculate anova for 18,000 genes by looping through each gene and storing result. > n_genes = 18000 > result = rep( 0, n_genes ) > for( counter in 1:n_genes ){ result[counter] = anova(...) } sort results identify significant interaction repeat 18,000 times David Quigley dquigley@cc.ucsf.edu
Immune infiltration in TP53-WT Basal CD3E Does p53 have a role in immune surveillance? log2 expression log2 expression absent mild severe David Quigley dquigley@cc.ucsf.edu infiltration
Next steps: getting help and learning more
online forums: expert help for free biostars.org all of bioinformatics David Quigley dquigley@cc.ucsf.edu
online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics David Quigley dquigley@cc.ucsf.edu
online forums: expert help for free seqanswers.com biostars.org Nextgen sequencing all of bioinformatics stats.stackexchange.com statistics David Quigley dquigley@cc.ucsf.edu
UCSF resources Library classes and information Formal courses (BMI, Biostatistics) Cores (Computational Biology, Genomics) QGDG monthly methods discussion group David Quigley dquigley@cc.ucsf.edu
Online classes and blogs Free courses on data analysis http://jhudatascience.org simplystatistics.org Coursera etc... Good tutorials on sequence analysis http://evomics.org/learning David Quigley dquigley@cc.ucsf.edu