Functional annotation and network reconstruction through cross-platform integration of microarray data

Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al. 2005

Challenges in microarray data analysis • Integration of multiple microarray data sets. • Different platforms, e.g. cDNA arrays, Affymetrix arrays • Alternative experimental parameters • Identification of functionally related genes which do not have similar expression patterns. • Reconstruction of transcriptional regulatory networks. • It is difficult to elucidate the cooperativity between TFS because the changes in their expression are often subtle and their activities are often controlled at levels other than expression.

Data pre-processing • Classify the 618 expression profiles into 39 data sets. A data set contains a set of expression profiles measured under relevant conditions. • 19 cDNA data sets from SMD • 4 Affymetrix data sets from GEO • 16 data sets from Rosetta

19 SMD data sets • Alpha factor release • cdc15 block release • DTT Exposure • Elutriation • Forkhead regulation • Gamma radiation • Menadione exposure • DNA damage (MMS) response • Nitrogen depletion • Nutrition limitation • Osmotic shock • SIR proteins (Chromatin Silencing) • Sorbitol effects • H2O2 response • Heat shock • Heat steady • CellCycle Factor • YPD Stationary phase • Zinc homoeostasis Corresponding to 19 SMD subcategories

4 GEO data sets • Aging • Chitin synthesis • Fermentation time course • Ume6 regulon

16 Rosetta data sets • Cell cycle control • Cell wall organization • Chromatin assembly • Ion homeostasis • Nucleotide metabolism • Organelle biogenesis • Perception of external stimulus • Protein biosynthesis • Protein degradation • Protein metabolism • Protein phosphorylation • Protein transport • Pseudohyphal growth • Steroid metabolism • Amino Acid Starvation • MAPK pathway Classification is based on the GeneOntology (GO) biological process categories of the deleted genes.

The idea: 2nd-order expression correlation • 1st-order expression correlation • Correlation of expression patterns from one data set • For each pair of genes, a vector of length n is obtained. n is the number of data sets. • 2nd-order expression correlation • Correlation of the 1st-order expression correlation

An example The overall expression similarity between the two gene pairs is not significantly high. However, their 1st-order expression correlation profiles exhibit high correlation, that is, the four genes have high 2nd-order expression correlation.

Clustering functionally related genes • Procedure • Identification of doublets • A doublet is a pair of genes that is tightly co-expressed in multiple data sets. • Clustering of doublets based on their 1st- order expression correlation profiles • Results • 72 of the top 100 tightest clusters are functionally homogeneous.

Gene function prediction • A prediction of function is made for a doublet only if it is in a tight cluster that includes at least three doublets and in which all remaining doublets share the same function. • 79 functions are assigned to 67 unknown genes. Some have been verified by experimental studies.

Reconstruction of regulatory networks • For each transcription module, a 1st-order average expression correlation profile (a vector with the same length as the number of data sets) is calculated. The profile of a module can be interpreted as the activity profile of the transcription factor(s) that regulate the module. • A transcription module is defined to be a set of genes that are regulated by the same transcription factor(s) based on genome-wide location data, and are coexpressed in multiple data sets. • 60 TM are identified. • A 2nd-order expression correlation is calculated for two activity profiles of transcription factors, to measure the cooperativity between the two transcription factors. • 34 pairs show high 2nd-order correlation.

Clustering of modules

Annotation of TFs • The function of a TF is predicted based on two evidences: • The functions of known genes in its target module • The functions of known genes in other modules in the same module cluster • TF GAT3 is predicted to play a role in mitotic and meiotic cell cycles.

Functional annotation and network reconstruction through cross-platform integration of microarray data