Mining Public Data for Insights into Human Disease

Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier

Utility of Gene Expression for Human Disease

Microarray Technology

Big Picture

Data Access

Gene Expression Microarray Repositories • Gene Expression Omnibus (GEO) • Hosted by: NCBI • Platform: All accepted • Normalization: Experiment by experiment basis • Access: R (GEOquery), EUtils • Meta-Information: GEOMetaDB • ArrayExpress • Hosted by: EMBL • Platform: All accepted • Normalization: Experiment by experiment basis • Access: Web interface, EMBL API • Meta-Information: ? (API) • Many smaller repositories which have more phenotypic information for specific diseases • Phenotypic information may be hard to access

Gene Expression Omnibus

Latest 3’ Affymetrix Array HGU133 Plus 2.0 HGU133A Samples Per Platform in GEO Affymetrix arrays account for ~67% of human gene expression data in public repositories.

Affymetrix Probesets >54,000 Probesets Perfect Match Probe ProbePair Probeset (11 Probe Pairs) Mismatch 25 nucleotides GeneChip U133 Plus 2.0 Array (Image stored as CEL file.)

Pre-Processing 101

Pre-Processing Gene Expression Data

Normally CDF File Comes from Affymetrix Alternative CDF File Thorougly Cleaned CEL File CEL File AltCDFFile CDFFile Intensities Intensities Zhang, et al. 2005 Removing Miss-Targeted and Non-Specific Probes

What Makes Cells Different?

PANP: Presence/Absence Filtering • Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution • NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand • Utilize this background distribution from these NSMPs to threshold the entire dataset • Output is a call for each array for each gene • Calls are: • P = presence • M = marginal • A = Absence

Identifying Present Genes • Filter out genes ≥ 50% absent • Whole dataset • Subsets • Only present genes are utilized in future analyses

Removing Redundancy

Reason for Removing Redundancy Before Running

Removing Redundancy • Collapse Affymetrix Probeset IDs to EntrezIDs • Test for correlation between probesets • If correlation is ≥ 0.8 then combine probesets • If not then leave them separate

= Implemented in R = Implemented in Python Pre-Processing Pipeline

Big Picture

Glioma:A Deadly Brain Cancer Wikimedia commons

Brain Anatomy Wikimedia commons

What do they do?

Neurophysiology

Hierarchy ofNervous Tissue Tumors

Glioma Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007

Glioma Survival 10 years 5 years http://www.neurooncology.ucla.edu/

Repository of Molecular Brain Neoplasia Data (REMBRANDT) • REMBRANDT (Madhavan et al., 2009) • Currently 257 individual specimens • Glioblastoma multiforme (GBM) = 110 • Astrocytoma = 50 • Oligodendroglioma = 55 • Mixed = 21 • Non-Tumor = 21 • Phenotypes • Tumor type: • GBM, Astrocytoma, etc. • WHO Grade: • 176 individuals • Age: • 253 individuals • Sex: • 250 individuals (partially inferred using Y chromosome genes) • Survival (days post diagnosis): • 169 individuals

REMBRANT:Chromosome Y Expression 8 males cluster with females 4 females cluster with males Female Male Sex specific gene expression Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome.

REMBRANT:Chr. Y Expression – Intelligent Reassignment Female Male Sex specific gene expression Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.

Progression of Astrocytic Glioma Furnari, et al. (2007)

Modeling Glioma • Increasing metastatic potential and severity of glioma could be modeled using this simple schema • Correlation of model to survival post diagnosis is -0.68 0 1 2

Exploring Meta-Information • Age explains 31% of survival post diagnosis • Age explains 25% of the progression model • Sex does not have a significant effect on either survival or the progression model • Yet it is known that glioblastoma is slightly more common in men than in women

Summary • Very ample dataset with good amount of meta-information • Ready for dimensionality reduction and network inference!

Big Picture

Clustering asDimensionality Reduction

Big Picture

Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity

Relative Genome Sizes

Solutions • Pre-process genomic sequences • Reduce data complexity by collapsing redundancies • Utilize filters that select for only the most variant genes

Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity

Eukaryotic Gene Structure

Transcription Factor Binding Sites (6-12bp motifs) miRNA binding sites (4-9bp motifs) Promoter 3’ UTR Regulatory Regions No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp

Mining Public Data for Insights into Human Disease

Mining Public Data for Insights into Human Disease

Presentation Transcript

Regression for Data Mining

Innovative Herbal Insights into Autoimmune Disease

Data Mining for the Early Detection of Disease Outbreaks

Data Mining for Earth Science Data

Insights into rare disease research

HUMAN DISEASE

Innovative Herbal Insights into Autoimmune Disease

Mining for ADE Data

A predictive model for cerebrovascular disease using data mining

Data Mining: Data

Turning Data Into Insights

Insights into Regulation

Insights into rare disease research

Data Mining: Data

Dendrograms for Data Mining

Insights into Climate Dynamics from Paleoclimate Data

HUMAN DISEASE

Integrating Discovery, Development, and Commercial Data into Data Mining

Data Mining for Data Streams

Data Mining for Engineers

Deep Insights Into Gum Disease Treatment Melbourne