970 likes | 1.21k Views
Mining Public Data for Insights into Human Disease. 11/16/2009 Baliga Lab Meeting Chris Plaisier. Utility of Gene Expression for Human Disease. Microarray Technology. Big Picture. Data Access. Gene Expression Microarray Repositories. Gene Expression Omnibus (GEO) Hosted by: NCBI
E N D
Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier
Gene Expression Microarray Repositories • Gene Expression Omnibus (GEO) • Hosted by: NCBI • Platform: All accepted • Normalization: Experiment by experiment basis • Access: R (GEOquery), EUtils • Meta-Information: GEOMetaDB • ArrayExpress • Hosted by: EMBL • Platform: All accepted • Normalization: Experiment by experiment basis • Access: Web interface, EMBL API • Meta-Information: ? (API) • Many smaller repositories which have more phenotypic information for specific diseases • Phenotypic information may be hard to access
Latest 3’ Affymetrix Array HGU133 Plus 2.0 HGU133A Samples Per Platform in GEO Affymetrix arrays account for ~67% of human gene expression data in public repositories.
Affymetrix Probesets >54,000 Probesets Perfect Match Probe ProbePair Probeset (11 Probe Pairs) Mismatch 25 nucleotides GeneChip U133 Plus 2.0 Array (Image stored as CEL file.)
Normally CDF File Comes from Affymetrix Alternative CDF File Thorougly Cleaned CEL File CEL File AltCDFFile CDFFile Intensities Intensities Zhang, et al. 2005 Removing Miss-Targeted and Non-Specific Probes
PANP: Presence/Absence Filtering • Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution • NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand • Utilize this background distribution from these NSMPs to threshold the entire dataset • Output is a call for each array for each gene • Calls are: • P = presence • M = marginal • A = Absence
Identifying Present Genes • Filter out genes ≥ 50% absent • Whole dataset • Subsets • Only present genes are utilized in future analyses
Removing Redundancy • Collapse Affymetrix Probeset IDs to EntrezIDs • Test for correlation between probesets • If correlation is ≥ 0.8 then combine probesets • If not then leave them separate
= Implemented in R = Implemented in Python Pre-Processing Pipeline
Glioma:A Deadly Brain Cancer Wikimedia commons
Brain Anatomy Wikimedia commons
Glioma Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007
Glioma Survival 10 years 5 years http://www.neurooncology.ucla.edu/
Repository of Molecular Brain Neoplasia Data (REMBRANDT) • REMBRANDT (Madhavan et al., 2009) • Currently 257 individual specimens • Glioblastoma multiforme (GBM) = 110 • Astrocytoma = 50 • Oligodendroglioma = 55 • Mixed = 21 • Non-Tumor = 21 • Phenotypes • Tumor type: • GBM, Astrocytoma, etc. • WHO Grade: • 176 individuals • Age: • 253 individuals • Sex: • 250 individuals (partially inferred using Y chromosome genes) • Survival (days post diagnosis): • 169 individuals
REMBRANT:Chromosome Y Expression 8 males cluster with females 4 females cluster with males Female Male Sex specific gene expression Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome.
REMBRANT:Chr. Y Expression – Intelligent Reassignment Female Male Sex specific gene expression Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.
Progression of Astrocytic Glioma Furnari, et al. (2007)
Modeling Glioma • Increasing metastatic potential and severity of glioma could be modeled using this simple schema • Correlation of model to survival post diagnosis is -0.68 0 1 2
Exploring Meta-Information • Age explains 31% of survival post diagnosis • Age explains 25% of the progression model • Sex does not have a significant effect on either survival or the progression model • Yet it is known that glioblastoma is slightly more common in men than in women
Summary • Very ample dataset with good amount of meta-information • Ready for dimensionality reduction and network inference!
Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity
Solutions • Pre-process genomic sequences • Reduce data complexity by collapsing redundancies • Utilize filters that select for only the most variant genes
Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity
Transcription Factor Binding Sites (6-12bp motifs) miRNA binding sites (4-9bp motifs) Promoter 3’ UTR Regulatory Regions No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp