210 likes | 393 Views
Data Management and Mining in BioArray Informatics. Prof. Yike Guo Dept. of Computing, Imperial College, London. Goal:. Understand the basic bioarray technology including microarray technology for gene expression, protein chips, NMR spectroscopy and other high throughout devices
E N D
Data Management and Mining in BioArray Informatics Prof. Yike Guo Dept. of Computing, Imperial College, London
Goal: • Understand the basic bioarray technology including microarray technology for gene expression, protein chips, NMR spectroscopy and other high throughout devices • Learn the basic analytical technology and its applications to the bioarray information • Learn the analysis processes of processing and analysing bioarray data (e.g. gene expression analysis)
Lecture Overview • Lecture One : BioArray Informatics Introduction • Lecture Two : BioArray Technology • Lecture Three : Analysis Technology (1)—Data Normalisation and Transformation • Lecture Four : Analysis Technology (2)--Clustering and Classification • Lecture Five : Analysis Technology (3)– Multivariate Statistics • Lecture Six : Analysis Applications (1)—Gene Expression Analysis • Lecture Seven: Analysis Application (2)—Integrative Analysis of BioArray Data
BioArray Informatics: Integrative Analysis of BioArray Data within the Biological Context secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments receptors signals pathways ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT linkage maps cytogenetic maps physical maps
“-OMICS WORLD” Gene Profile Time Time Protein Profile Time Time Time Metabolic Profile Functional -Omics Analysis “REAL WORLD” “INPUTS” NOXIOUS AGENT/STRESSOR “OUTPUTS” “BIOLOGICAL END-POINTS” PATHOLOGY ALTERED PHYSIOLOGY AND METABOLISM
Metabolites RNA A Dynamics in BioArray Informatics Interactions Environment DNA Protein Growth rate Expression
forwards-propagated correlations metabolites protein mRNA time event A mathematical model
Gene 1 Receptor Ligand 2 3 9 8 4,5,6 Protein 7 BioArray Provides the Means for Revealing the Interaction Relations 1- gene homologs 2- gene encodes a protein 3- protein can regulate the expression of a gene 4- protein phosphorylates another protein 5- protein binds to another protein 6- protein lyses another protein 7- Proteins can sometimes be receptors 8- Receptors bind a ligand 9- Receptors (if bound) activate other proteins
ORF • Averaged PM-MM • “presence” • feature statistics • 25-mers Affymetrix2 25-bp hybridization PM MM BioArray: Quantitative Measurement of Biological Concepts Microarrays1 ~1000 bp hybridization experiment ORF • R/G ratios • R, G values • quality indicators control
Quantitative Analysis Reproducibility confidence intervals to find significant deviations
BioArray Informatics: BioArray is the data, everything else is Informatics • Data Engineering • Data Warehousing • Data Integration • Data Analysis • Knowledge Discovery • Discovery Integration • Discovery Validation • Knowledge Integration • Knowledge Warehousing
KEGG Sample & Clinical Data BioArray Data Unigene Genbank Experimental/Sample Database Expression Database Function Annotations Structure Annotations Data Warehousing Data Sources External Data Sources Operational Data Sources Data Warehousing:
ExPASy SwissProt PDB ExPASy Enzyme LocusLink MGD SPAD NCBI dbSNP UniGene Data Schema in Warehousing :A Gene Expression Example Gene Expression Warehouse OMIM Enzyme Disease Protein Affy Fragment Known Gene Sequence Pathway SNP Metabolite Sequence Cluster Genbank KEGG NMR
GXDW A Workflow of Gene Expression Database Data Reduction Queries Warehousing Output Comparisons Profile Report between 2 samples Set Fold Change Comparisons (e.g., > 2X) between multiple Data in User defined samples analysis dataset Set higher avg difference value (e.g., >200) Visualisation A->P/ P->A stringency (e.g., 80%) Advanced Gene Expression Analysis
Queries, Queries….. • Query to the data • Which genes are linked ? • Which genes are expressed similarly to my gene XYZ? • Which genes are co-expressed in differing conditions ? • classification (of tumors, diseased tissues etc.): which patterns are characteristic for a certain class of samples, which genes are involved? • functional classification of genes: Are changes clustered in particular classes? • metabolic pathway information: Is a certain pathway/route in a pathway affected? • disease information & clinical follow up: correlation to expression patterns. • phenotype information for mutants: Are there correlations between particular phenotypes and expression patterns?
Gene Expression Data Analysis Work Flow Data in Knowledge Deliverables Interactive Analysis Procedures analysis Cluster by genes Study outliers Correlate clinical measurements Literature analysis Time course analysis Defined subsets of genes Classic drug targets [Examples, not Known disease association exhaustive] Cross species indices
(Un)fortunately, Scientists never think linearly • Why those genes are co-expressed? • What do their protein products do? • What is the common regulatory motifs of a co-expressed gene set? • Can we patent them? • Do we know which metabolic pathway they are in? If there is no, can I synthesis one? • Are there HTS results for any proteins in the pathway? • Are there any compounds in the HTS library that hit selectively and consistently against those proteins? • Which ones have good activity, availability and toxicity?
Discovery Annotation and Validation E.X. Annotating a set of co-expressed genes with some conserved regulatory motifs E.X. Scoring a co-expression pattern with pathways E.X. Literature analysis to annotate biological semantics Integrative Analysis E.X. Multi-modality Analysis E.X. Cross Annotation of Discovered Patterns Modelling and Simulation E.X. Pathway Synthesis E.X. Virtual Cell Modelling Advanced Analysis
P1 Pathway Scoring
GPE-Score(Pathway) Analysis of Gene Expression Data with Pathway Scores Our Approach