610 likes | 871 Views
Microarray data analysis with Chipster 22.9.2008. Jarno Tuimala. Program – an analysis workflow. Basic functionality of Chipster Data import Quality control Normalization Describing the experiment Filtering and missing value considerations Statistical testing
E N D
Microarray data analysis with Chipster22.9.2008 Jarno Tuimala
Program – an analysis workflow • Basic functionality of Chipster • Data import • Quality control • Normalization • Describing the experiment • Filtering and missing value considerations • Statistical testing • Clustering and visualization • Annotation
Chipster • Goal: Easy access to leading analysis tools such as those developed in the R/Bioconductor project • Features • Easy to use graphical user interface • Comprehensive selection of tools • Support for different array types (Affymetrix, Agilent, Illumina, cDNA) • Compatible with Windows, Linux and Mac OS X • Easy to install and update • Wizards and workflows • Interactive graphics • Transparency (as opposed to “black box”) • Alternative annotations for Affymetrix arrays • Automatic tracking of performed analyses • http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf • http://chipster.csc.fi
CSC internet desktop SSL front end security client SOAP analyser international Web Services How does it work? Java Web Start installs and updates client automatically Corona/Murska VISUALISATION ANALYSIS
Tools Data Visualization
Phenodata – describing your experiment • Phenodata file is created during normalization • Fill in the group column with numbers describing your experimental setup • e.g. 1 = healthy control, 2 = cancer sample • necessary for the statistical tests to work • If you bring in previously created normalized data and phenodata: • Choose ”import directly” in the import tool • Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation” • If you brought in normalized data and need to create phenodata for it: • Utilities/ Generate phenodata (fill in the chiptype parameter!) • Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation” • Fill in the group column
Visualizing the data • Data visualization panel • Maximize and redraw for better viewing • Two types of visualizations • Interactive visualizations produced by the client program • Select the visualization method from the pulldown menu of the data visualization panel • Save by right clicking on the image • Static images produced by R/Bioconductor, Weeder, etc • Select from Analysis tools/ Visualisation • View by double clicking on the image file • Save by right clicking on the file name and choosing ”Export”
Interactive visualizations by the client • Spreadsheet • Histogram • Scatterplot • 3D scatterplot • Expression profiles • Clustered profiles • Hierarchical clustering • SOM clustering • Array pseudo-image • Venn diagram Available actions: • Change titles, colors etc • Zoom in/out
Static images produced by R/Bioconductor • Volcano plot • Box plot • Histogram • Heatmap • Venn diagram • Idiogram • Chromosomal position • Correlogram • Dendrogram • QC stats plot • RNA degradation plot • K-means clustering • SOM-clustering
Running many analyses simultaneously • You can have max 5 analysis jobs running at the same time • Use Task manager to • view parameters, status,… • cancel jobs
Workspace – continue later/elsewhere • Saving your workspace allows you to continue later • File/ Save workspace • File/ Load workspace • Currently it is possible to have only one workspace saved at the time • If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location • C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot
Importing files • Affymetrix CEL-files are imported to Chipster automatically • Other files are imported using the Import tool
Import tool, step 1 • Define • Header • Footer • Title row • Delimiter
Import tool, step 2 • Define columns • Modify flags
Importing Agilent files (required fields) • Sample (rMeanSignal) • Sample background (rBGMedianSignal) • Control (gMeanSignal) • Control background (gBGMedianSignal) • Identifier (ProbeName) • Annotation (ControlType) • Flag (IsManualFlag) • https://extras.csc.fi/biosciences/chipster-manual/data-formats.html
Quality control tools • Quality control -tools • Affymetrix basic • RNA degradation + Affy QC • Agilent • MA-plot + density plot + boxplot • Visualization – dendrogram • Statistics - NMDS
Affymetrix I • Quality control tools are run on raw data (CEL files). • Dendrogram and NMDS on normalized data
QC-tools in Chipster • Quality control • Affymetrix basic • Affymetrix RLE and NUSE • Agilent 2-color • Visualization • Dendrogram • Heatmap • Correlogram • Statistics • NMDS
What is normalization? • Normalization is the process of removing systematic variation from the data. • Typically you would normalize your data so that all the chips become comparable.
Methods • Affymetrix • Background correction + expression estimation + summarization • RMA (default) uses only PM probes, fits a model to them, and gives out expression values after quantile normalization and median polishing • Agilent • Background correction + averaging duplicate spots + normalization • After normalization the expression values are always expressed on log2-scale
Affymetrix • Methods: MAS5, Plier, RMA, GCRMA, Li-Wong • MAS5 is the older Affymetrix method, Plier is a newer one • RMA is the default, and works rather nicely if you have more than a few chips • GCRMA is similar to RMA, but takes also GC% content into account • Li-Wong is the method implemented in dChip • Variance stabilization makes the variance over all the chips similar • Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon) • Custom chiptype • If you want to use reannotated probes (they are really assigned to the genes where they belong), select one from this menu.
Agilent I • Background correction • Background treatment • None, Subtract, Edwards, Normexp • Background offset • 0 or 50 • Normalize chips • None, median, loess • Normalize genes (not typically used) • None, scale (to median), quantile • Chiptype • A must setting!
Agilent II • Background treatment typically generates many negative values that are coded as missing values after log2-transformation. • Usual subtract option does this • Using normexp + offset 50 will generate no negative values, and gives rather good estimates (best method reported) • Loess removes curvature from the data (suggested)
Gene filtering • Removing probes for genes that are • Not expressed • Expressed at constant level (not changing) • Often a good idea, and necessary before multiple testing correction can be adequately applied • Some controversy on this… • Non-specific filtering • Expression, flags, SD, … • Specific filtering • Statistical testing
Non-specific filtering • Often used for removing bad quality data: • Intensity value too low • Intensity value saturated • Appearance of the spot is abnormal • Typically, non-changing genes are also removed • These can be removed using • Filter by standard deviation • Filter by interquartile range • Filter by expression
Specific filtering • Selecting genes that are associated with some phenotype • Typically involves statistical testing • Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value. • Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect. • Take both into account by combining the filters. • Filter on expression value (what is biologically significant) and test for differences (what is statistically significant)
Unspecific filtering in Chipster • Pre-processing • Filter by expression • Select the upper and lower cut-offs • Select the number of chips this rule has to fulfilled on • Select whether to return genes inside or outside the range • Filter by SD • Select the percentage of genes to filter out • Filter by interquartile range (IQR) • Select the IQR • Filter by coefficient of variation (CV) • Median is used for filtering on CV (cannot be changed) • Utilities • Calculate descriptive statistics • Filter using a column
Venn diagram • Select three datasets in Chipster • Run the Venn diagram tool from Visualization tool category SD CV IQR
Some terminology • Usually tests for comparing means of two or more groups are used • Variance might be of interest too, but in practise this is never done. • Parametric tests (assume data normally distributed) • Typically used for microarray data • Non-parametric tests (assume no normality) • P-value • Risk of saying that there is a difference when there really isn’t • Traditionally 0.05 is used as a cut-off for significance • False discovery range is a p-value corrected for multiple tests (more on this later)
Statistical testing • Needs replication (>2 chips per group) • Replication makes it possible to estimate uncertainty or variability in the measurements. This is typically measured by standard deviation. • Comparing means (parametric tests) • One-group tests • Compare to a known mean • Example: One-sample t-test • Two-group tests • Compare two groups’ means • Example: Two-sample t-test • Several group tests • Compare several groups’ means • Example: Analysis of variance (ANOVA) • Two or more groups, two or more factors • Compare means in the groups according to both factor simultaneously • Example: multiple linear regression (linear modeling in Chipster)
t-test • Compares means of two groups • If the p-value is small that means that there is a difference between the groups. • If the p-value is large (>0.05), there is no difference between the groups. • p-value is a risk of saying that there is a difference when there actually isn’t. • A test for every gene is run separately -> thousands of tests and p-values
ANOVA • A generalization of t-test. • Compares means of several groups. • Tells whether the means are different, but not which means differ from each other. • For this you can use post-hoc tests (not implemented in Chipster) or linear modelling (implemented in Chipster) • A test for every gene is run separately -> thousands of tests and p-values
Multiple testing correction I • After getting the results for all the genes, p-values are adjusted for the number of tests conducted. • When making several comparisons using the same test, some of the results will be chance findings. • Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50). • This can be corrected for (to some extent) by using a multiple testing correction. • Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives. • Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.
Multiple testing correction II • The ranking of the genes does not change after multiple testing correction! • If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before or after the multiple testing correction. • If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.