From Expression to Regulation: the online analysis of microarray data

From Expression to Regulation:the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD

K.U.Leuven • Founded in 1425 • Situated in the center of Belgium • Some numbers: • 25.000 students • 2.500 researchers • 1.000 professors • University Hospital with 1.500 beds http://www.esat.kuleuven.ac.be/~dna/BioI/

ESAT-SCD • Faculty of Engineering • Mathematical engineering (120) • Systems and control • Data mining and Neural Nets • Biomedical signal processing • Telecommunications • Bioinformatics • Cryptography http://www.esat.kuleuven.ac.be/~dna/BioI/

Bioinformatics team • Research in medical informatics and bioinformatics • Research on algorithmic methods • Interdisciplinary team • 15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students) • Engineering, physics, mathematics, computer science, biotech, and medicine • Collaborative research with molecular biologists and clinicians • VIB MicroArray Facility: primary analysis of microarray data • University of Gent-VIB, Plant Genetics: motif discovery • KUL-VIB, Center for Human Genetics • Neuronal development in mice neurons • Targets of PLAG1 (pleiomorphic adenoma gene) • KUL, Obstetrics and Gynecology • Diagnosis of ovarian tumors from ultrasonography (IOTA) • Microarray analysis of ovarian tumor biopsies http://www.esat.kuleuven.ac.be/~dna/BioI/

Overview • Short introduction to microarrays • Exploratory analysis of microarray data • Clustering gene expression profiles • Upstream sequence retrieval • Motif finding in sets of co-expressed genes http://www.esat.kuleuven.ac.be/~dna/BioI/

cDNA microarrays • Collaboration with VIB microarray facility. • 5000 cDNAs (genes, ESTs) spotted on array • Cy3, Cy5 labeling of samples • Hybridization (test, control) • Laser scanning & image analysis • Arabidopsis, mouse, and human http://www.esat.kuleuven.ac.be/~dna/BioI/

Microarray experiment • Collecting samples • Extracting mRNA • Labeling • Hybridizing • Scanning • Visualizing http://www.esat.kuleuven.ac.be/~dna/BioI/

Clones Plasmide preparation PCR amplification Reordering Spotting Zoom - pins Microarray production http://www.esat.kuleuven.ac.be/~dna/BioI/

GenBank A1234 Z4321 Microarrays Blast start Clustering Gibbs sampler start From expression to regulation http://www.esat.kuleuven.ac.be/~dna/BioI/

Exploratory data analysis http://www.esat.kuleuven.ac.be/~dna/BioI/

Data exploration • Subset selection based on • Gene Ontology functional classes • Keywords, gene names • Check the expression profiles of individual genes • Visualization expression profiles of gene families • Link to upstream sequence retrieval http://www.esat.kuleuven.ac.be/~dna/BioI/

Gene Ontology http://www.esat.kuleuven.ac.be/~dna/BioI/

Subset selection http://www.esat.kuleuven.ac.be/~dna/BioI/

Profile inspection http://www.esat.kuleuven.ac.be/~dna/BioI/

Profile visualization http://www.esat.kuleuven.ac.be/~dna/BioI/

Sequence Retrieval http://www.esat.kuleuven.ac.be/~dna/BioI/

Clustering http://www.esat.kuleuven.ac.be/~dna/BioI/

Goal of clustering • Exploration of microarray data • Form coherent groups of • Genes • Patient samples (e.g., tumors) • Drug or toxin response • Study these groups to get insight into biological processes • Genes in same clusters can have the same function or same regulation http://www.esat.kuleuven.ac.be/~dna/BioI/

Initialization K-means • Initialization • Choose the number of clusters Kand start from random positions for the K centers • Iteration • Assign points to the closest center • Move each center to the center of mass of the assigned points • Termination • Stop when the centers have converged or maximum number of iterations http://www.esat.kuleuven.ac.be/~dna/BioI/

Iteration 1 K-means • Initialization • Choose the number of clusters Kand start from random positions for the K centers • Iteration • Assign points to the closest center • Move each center to the center of mass of the assigned points • Termination • Stop when the centers have converged or maximum number of iterations http://www.esat.kuleuven.ac.be/~dna/BioI/

Iteration 3 K-means • Initialization • Choose the number of clusters Kand start from random positions for the K centers • Iteration • Assign points to the closest center • Move each center to the center of mass of the assigned points • Termination • Stop when the centers have converged or maximum number of iterations http://www.esat.kuleuven.ac.be/~dna/BioI/

Hierarchical clustering • Construction of gene tree based on correlation matrix http://www.esat.kuleuven.ac.be/~dna/BioI/

K-means clustering Need for new clustering algorithms • Noisy genes deteriorate consistency of profiles in cluster • All genes forced into cluster http://www.esat.kuleuven.ac.be/~dna/BioI/

Adaptive quality-based clustering • For discovery, biologists are looking for highly coherent, reliable clusters • Other needs for clustering microarray data • Fast + limited memory (need to analyze thousands of genes) • No need to specify number of clusters in advance • Few and intuitive parameters • AQBC = 2 step algorithm • Cluster center localization • Cluster radius estimation with EM • Read more: • De Smet et al. (2002) Bioinformatics, in press. http://www.esat.kuleuven.ac.be/~dna/BioI/

Step 1: localization of cluster center http://www.esat.kuleuven.ac.be/~dna/BioI/

Step 2: re-estimation of cluster radius • Distance from cluster center randomly distributed except for small group (= cluster elements) • Size of cluster can be estimated automatically by EM • Step 3: remove cluster points and look for new cluster http://www.esat.kuleuven.ac.be/~dna/BioI/

Comparison with K-means A.Q.B.C. K-means: • User defined parameters • Quality criterion (QC): • % defines how significant a cluster should be separated from background • Minimal number of genes in a cluster • User-defined parameters • Number of clusters • Number of iterations • Advantages • Outcome not sensitive to parameter setting • Number of clusters is determined automatically • Based on QC an optimal radius is calculated for each cluster • Set of smaller clusters containing genes with highly similar expression profile (fewer false positives) • Noisy genes are rejected • Disadvantages • Outcome sensitive towards parameter setting • Extensive fine-tuning required to find optimal number of clusters • Separation and merging of clusters based on visual inspection and not on statistical foundation • No quality criterion: more false positives • All genes will be clustered (noisy clusters) • Disadvantages • Some information is rejected: clusters too small • Advantages • Fewer true positives are rejected

Adaptive Quality-Based Clustering Web Interface http://www.esat.kuleuven.ac.be/~dna/BioI/

Cluster results page Upstream Sequence Retrieval http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream sequence retrieval http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream Sequence Retrieval • Identify all genes in cluster based on given accession number and gene name. • Delineate upstream region based on sequence annotation. • Check for presence of annotated upstream gene. • IF upstream gene found THEN select intergenic region ELSE blast gene to find genomic DNA where gene is annotated. • Parse blast reports to find intergenic regions • Report results in GFF. http://www.esat.kuleuven.ac.be/~dna/BioI/

Gene Identification http://www.esat.kuleuven.ac.be/~dna/BioI/

Selected sequences & genes to be blasted http://www.esat.kuleuven.ac.be/~dna/BioI/

Results blast report parsing http://www.esat.kuleuven.ac.be/~dna/BioI/

Selected sequences http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Finding http://www.esat.kuleuven.ac.be/~dna/BioI/

Transcriptional regulation • Complex integration of multiple signals determines gene activity • Combinatorial control http://www.esat.kuleuven.ac.be/~dna/BioI/

Identifying regulatory elements from expression data • Cluster genes from microarray expression data to build clusters of co-expressed genes • Co-expressed genes may share regulatory mechanisms • Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana) • Motifs that are statistically overrepresented in the upstream regions are candidate regulatory sequences http://www.esat.kuleuven.ac.be/~dna/BioI/

Upstream sequence model • Motifs are hidden in noisy background sequence. • Data set contains two types of sequences: • Sequences with one or more copies of the common motif. • Sequences with no copy of the common motif. http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler • Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262:208-214) • Probabilistic sequence model • Changes and additions: • Use of higher-order background model. • Use of probability distribution to estimate number of copies. • Different motifs are found and masked in consecutive runs of the algorithm. • Read more: • Thijs et al. (2001) Bioinformatics 17(12), 1113-1122 • Thijs et al. (2002) J.Comp.Biol. 9(2), 447-464 http://www.esat.kuleuven.ac.be/~dna/BioI/

Intergenic region Core promoter gene Background model • Representation of DNA sequence by higher-order Markov Chain: • Reliable model can be build from selected intergenic DNA sequences. • Intergenic sequence = non-coding region between two consecutive genes. • Only regions that contain core promoter are selected. http://www.esat.kuleuven.ac.be/~dna/BioI/

Algorithm: Initialization • Calculate background model score • Start from random set of motif positions • Create initial motif model http://www.esat.kuleuven.ac.be/~dna/BioI/

Algorithm: iterative procedure • Score sequences with current motif model Calculate distribution Sample new alignment position Iterate for fixed number of steps http://www.esat.kuleuven.ac.be/~dna/BioI/

Algorithm: Convergence Select best scoring positions from Wx to create motif and alingment http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/

Motif Sampler results page http://www.esat.kuleuven.ac.be/~dna/BioI/

Example: Plant wounding • 150 Arabidopsis genes • Mechanical plant wounding • 7 (or 8) time points over a 24h period • Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes. • Search for a motif of length 8 and a motif of length 12 in 4 clusters Reymond, P et al.. 2000. Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell12(5): 707--20. http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 1 http://www.esat.kuleuven.ac.be/~dna/BioI/

Results: Cluster 2 http://www.esat.kuleuven.ac.be/~dna/BioI/

From Expression to Regulation: the online analysis of microarray data