MSCL Analyst’s Toolbox

MSCL Analyst’s Toolbox Instructors: Jennifer Barb, Zoila G. Rangel, Peter Munson Jan 2008 Mathematical and Statistical Computing Laboratory Division of Computational Bioscience

Course Outline Day 1 • MSCL Analyst’s Toolbox and JMP™ overview • MSCL Toolbox Concepts • JMP™fundamentals • Lunch • Affymetrix ExpressionConsole™, processing .cel files, exporting data • MSCL Toolbox Demo • Data input • Basic Analysis (Master File, Final File, Data normalization, QC, PCA, ) • Gene selection, statistical tests (p-values, FDR) • Annotation Day 2 • Statistical Topics (PCA, Data normalization, FDR) • MSCL Analyst’sToolbox Demo (cont.) • Complex Analysis (2-way ANOVA, blocked ANOVA) • Data Visualization

Topics not included • Exon Array Analysis -- coming soon! • SNP chip • Resequencing analysis, ChIP-Chip, copy number • 2-color or spotted cDNA array analysis • complete JMP tutorial • JMP on Mac, Linux • JMP scripting language • Data management commands in JMP: • Stack, Split, Concatenate, Sort

Why use JMP? • Interactive graphics facilitates data exploration, discovery of features • Powerful, > 2,00,000 rows by 100s of columns (currently, 2 GB limit) • Scripting language -- object oriented, allows matrix manipulation • Connects to database servers including NIHLIMS or local GCOS • JMP is also general purpose statistics pack • Good technical support for JMP from: (919) 677-8008 or www.jmp.com • No direct cost to individual NIH users* (centrally supported in most NIH ICs) • MSCL Analyst's Toolbox is FREE, adds tools for microarray studies

MSCL Analyst’s Toolbox Features • Menu driven • Automated gene annotations • Web link-out** • Highly interactive, intuitive user interface • Analysis pipeline, based on years of experience • Familiar parametric analysis, e.g. ANOVA • Exploratory Data Analysis • Adaptable to new designs, analyses (e.g. Exon chips, SNP chips) • Powerful, handles largest Affy chips, probe-level analysis • Up to hundreds of chips at once • PC, Mac or Linux desktops • Support available through MSCL

MSCL Analyst’s Toolbox Capabilities • Connects to the central NIHLIMS database or local GCOS databases • Reads in Pivot Tables from Affymetrix EC™ or GCOS™ • Visualizes Principal Components • Analyzes simple experiments (paired, unpaired T-tests) • Analyzes complex experiments (multiple treatments, time series, linear trends, slope changes between treatments) • Compensates for “batch” effects • Selects and annotates significant genes • Manages multiple gene lists (intersection, union, Venn diagrams) • Multivariate, Cluster, Discriminant, Neural net analysis • Uses dynamic visualization tools

How to obtain: • JMP • http://isdp.cit.nih.gov/downloads/stats.asp • Find your desktop support person at http://isdp.cit.nih.gov/information/contact_lookup_nih.asp • JMP technical support from (919) 677-8008 • The MSCL Analyst's Toolbox • Download from http://affylims.cit.nih.gov • Help offered on collaborative basis by MSCL • Email questions to: munson@helix.nih.gov

NIH Bioinformatics Cooperativehttp://affylims.cit.nih.gov

MSCL Toolbox Data Pipeline: files Xform • Input files or Fetch data • Transform and normalize • Principal Components Analysis • Create Master file, add treatment groups • Compute statistical test, get p-values • Correct for multiple comparisons or use FalseDiscoveryRate • Compute log fold-change • Visualize results • Select relevant genes PCA Final Master

Data sources: • NIHLIMS database via ODBC connection • Local GCOS database via ODBC connection • GCOS pivot table • EC pivot table (NEW support for this option) • Excel spread sheet • Text files

Data Input or data fetch NIHLIMS database Publish(MAS) EC™ or GCOS™ MAS5™ MSCL Publish DB <username/password> Process DB <experiment> .dat files .cel files .chp files .rpt files DCEG/NCI Publish DB <username/password> CCMD Publish DB Import ODBC access archive(LM) delete(LM) assume ownership(LM) Import(LM) Export(LM) client workstation Analyze (MAS) .txt client files Fluidics Platform Scanner DMT A-SCAN Partek GeneSpring

1 Genes 20,000 1 16 Samples Gene Expression Data Matrix Sample information Expression Matrix Gene Annotations

1 Genes 20,000 Annotations for each gene • Probe Set ID • Genbank ID • Unigene ID, Title • Entrez Gene ID • Cytogenetic map location • Physical map location • HUGO gene symbol, synonyms • Functional relevance • Associated literature references ... • GO terms for molecular process, biological function or cellular component Gene Annotations

Annotation Files: • Affymetrix annotations for each probeset have been downloaded and formatted for MSCL Toolbox, available at affylims.cit.nih.gov • Annotations are updated quarterly • Annotation tables may be JOINed by ProbeSetID • Probe Set ID • Gene Title • Gene Symbol • UnigeneID • Transcript ID • Ensembl • Entrez Gene • Representative Public ID • First SwissProt • Genome Alignment Chromosome • Genome Alignment Start Address • Genome Alignment Stop Address • Genome Alignment Strand • Chromosomal Location Final Annot. Final-Annot

Annotating Genes Netaffx, reformatted Your data file “JOIN” on ProbeSetID

1 Samples 16 Information about the Sample(transposed into MasterFile) • Clinical information (human) • Diagnosis • Demographic information • Treatment (in vivo, in vitro) in designed experiment • Tissue of origin • Cell culture, strain, passage • Sampling date/time • RNA preparation protocol • Operator/batch/lot/laboratory information • QC information (rawQ, scale factor, 3/5-actin, 3/5-GAPDH, etc) Information about each Sample

Table formats • JMP usually deals with a single Table, but… • TWO tables are needed for MSCL Analyst’s Toolbox: • 1. "Master File" layout • Each ROW represents a chip • Columns define treatment, replicate number, etc. • 2. "Final" layout • COLUMNs correspond to chips (rows in Master File) • Each ROW is a probe set, unique identifier is probe set ID • Tables are LINKED by “Shortnames” field in Master

Master File -- one row per chip Final File -- one row per probe set Linked Table Formats

Naming Convention for Final File Columns (prefixes) • Data type: AD-, SG-, PA- • Data transform: L-, Lmed-, GL-, S10- • Statistical results: p-, FDR-, mean-, SFC- • Column Naming Tips: • Avoid punctuation, hyphen, period, slash, etc. • Avoid spaces, use underscore “_” instead • Shorter is better • Toolbox utility available for trimming column names Column Name ITEM_NAME SG-33NH SG-33TH S10-33NH S10-33TH PA-33NH PA-33TH SFC-7 SFC-11 p-slope&cent2 FDR slope&cent2

Data Pipeline: files Xform • Input files or Fetch data • Transform and normalize • Principal Components Analysis • Create Master file, add treatment groups • Compute statistical test, get p-values • Correct for multiple comparisons or use FalseDiscoveryRate • Compute log fold-change • Visualize results • Select relevant genes PCA Final Master

Data Transformation and Normalization

Log(x/median x) transform (“Lmed”)

Principal Components Analysis PC 2(12%) PC 1(38%)

Analysis Scripts • ANOVA1 • T-test, unequal variance • Paired t-test • Consistency test • ANOVA1 with blocking • ANOVA2 with interaction terms (unbalanced data allowed) • ANOVA2 with blocking • Linear regression • ANCOVA with blocking (balanced data case) • ANCOVA2 with blocking (balanced data case) • Other tests are easily added (requires scripting)

Log(FoldChange)=“LFC” FoldChange = treated / control Log(FoldChange) = Log(treated / control) = Log(treated) - Log(control) Rule of Thumb for Base10 Logarithms: Log10(2-fold change) = 0.3 Log10(10-fold change) = 1 Log10(0.1-fold change) = -1

Significance of change Magnitude of change, Log Scale Volcano Plot Selection Regions

Interpreting Gene Lists Final Annot. Filter (FDR<10%) Ingenuity™, GeneGo™ GeneList Significant Terms

GO-SCAN- Gene Ontology Annotations • Gene Ontology for Significant Collection of Annotations: GO-SCAN is a bioinformatics • tool that selects and presents relevant Gene Ontology (GO) annotations for a gene "hit" • list from an Affymetrix microarray experiment. http://goscan.cit.nih.gov/

Ingenuity Pathway Analysis(Doug Joubert, NIH Library)

MSCL Analyst’s Toolbox