T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan

The State of Gene Function Prediction in Arabidopsis thaliana T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan Introduction to Computational Biology and Bioinformatics (CS 3824 October 11, 13, 2011

How a cell is wired Small molecules Environment DNA mRNA Protein Regulatory RNA The dynamics of such interactions emerge as cellular processes and functions

Molecular interaction networks How do the genes and their products interact to collectively perform a function? Gene G 35 RPM U2AF Inhibitor A B Gene G

Molecular interaction networks Functional ^ A network containing genes connected to each other whenever they physically or functionally interact • Proteins that interact/co-complex (ribosomal, polymerase, etc.) • Transcription factors and their target • Enzymes catalyzing different steps in the same metabolic pathway • Genes with correlation in expression • Genes with similar phylogenetic profiles

Arabidopsis is the primary model organism for plants • Complex organization from molecular to whole organism level. • A key challenge … • Understanding the cellular machinery that sustains this complexity. • In the current post-genomic times, a main aspect of this challenge is ‘gene function prediction’: • Identification of functions of all the (~30, 000) genes in the genome.

Extent of gene annotations in Arabidopsis Total of ~30,000 genes in the genome ~15% with some experimental annotation Leaving ~50% of the genome without any annotation ~8% with ‘expert’ annotation ~13% with annotations based on manually curated computational analysis ~14% with electronic annotations Ashburner et al, (2000) Nat. Gen. Swarbreck et al (2008) Nuc. Acids. Res.

Exploit high-throughput data • Integrating functional genomic data could lead to • Network models of gene interactions that resemble the underlying cellular map. • Typically these networks contain gene functional interactions • Connecting pairs of genes that participate in the same biological processes. • In such a network, the very place of a gene establishes the functional context that gene. • ‘Guilt-by-association’ – genes of unknown functions can also be imputed with the function of their annotated neighbors.

Functional interaction networks • Functional interaction network models have been developed for Arabidopsis. • Lee et al. (2010) Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. • Very comprehensive in terms of using and integrating datasets in other organisms for application in plants. • Integrated 24 datasets: 5 datasets from Arabidopsis and the rest from other models. • AraNet: 19,647 genes, 1,062,222 interactions.

Goal of this study … • We examine the state of network-based gene function prediction in Arabidopsis. • Evaluate the performance of multiple prediction algorithms on AraNet. • Assesses the influence of the number of genes annotated to a function and the source of annotation evidence. • Compute the correlation of prediction performance with network properties. • Evaluate prediction performance for plant-specific functions.

Network-based gene function prediction algorithms Propagation of functional annotations across the network Guilt-by-association using direct interactions Use positive and negative examples SinkSource Hopfield Local Use only positive examples FunctionalFlow – multiple phases FunctionalFlow – 1 phase Local+ Each gene in the network

Network-based gene function prediction

Network-based gene function prediction • Function A • Function B

In this study … Sink Source • Precision: fraction of predictions that are correct • TP(TP + FP) Recall: fraction of known examples predicted correctly • TP(TP + FN)

Performance of different algorithms • Computational gene function prediction precedes and guides experimental validation • What we get is a ranked list of novel predictions • An experimenter would choose a manageable number of top-scoring predictions to pursue • Precision at the top of the prediction list • We choose precision at 20% recall (P20R) as the measure of performance

Performance of different algorithms SS seems to be better than the other algorithms Using only annotations based on experimental/expert evidence 3rd quartile Median 1st quartile What about the influence of the number of genes in a function?

Performance of different algorithms Each group containing ~125 functions First group Third group Second group Number of functions Number of genes annotated with a function

Performance of different algorithms For ‘small’ functions, the algorithm does not matter! And, using just experimental annotations is better when you know little about a function. • For ‘large’ functions • SS is clearly the best • - Using all annotation is better For ‘medium’ functions, SS is a little better and use of ‘electronic’ evidences is mixed.

Performance of different algorithms Wilcoxon test: SS vs. other algorithms All ECs Sans IEA/ISS Overall, SinkSource appears to be best algorithm.

Correlation of performance with network properties • Performance on a particular function might depend on how its genes are organized / connected among themselves in the network. • Number of nodes • Number of components • Fraction of nodes in the largest connected component • Total edge weight • Weighted density • Average weighted degree • Average segregation

Correlation of performance with network properties

Correlation of performance with network properties • Number of nodes = 9 • Number of components = 3 • Fraction of nodes in the largest connected component = 4/9 • Total edge weight = 8 • Weighted density = 8/36 • Average weighted degree = 16/9

Correlation of performance with network properties Functional modularity: Average Segregation

Correlation of performance with network properties Functional modularity: Average Segregation • Avg. seg = 8/22 • Avg. seg = 12/15

Correlation of performance with network properties • We have … • Vector of SS P20R values for each function • Vector of values of a particular topological property for each function • Spearman rank correlation P20R Weighted density

Correlation of performance with network properties Spearman rank correlation

Performance on plant-specific functions • The underlying network is built based on data from multiple non-plant species Using only annotations based on experimental/expert evidence • For ‘plant-specific’ functions • Performance is much worse compared to ‘conserved’ functions • Using only experimental annotations is better • For ‘conserved’ functions • Performance is better than that for all functions • Using all annotations is better 3rd quartile Median 1st quartile

Most predictable ‘conserved’ functions • protein folding • nucleotide transport • innate immunity • cytoskeleton organization, and • cell cycle

Least predictable ‘conserved’ functions Specialized functions • regulation of …

Most predictable ‘plant-specific’ functions Contribution from Arabidopsis datasets • cell wall modification • auxin/cytokinin signaling, and • photosynthesis

Least predictable ‘plant-specific’ functions • development, morphogenesis • pattern formation • phase transitions of various tissues, organs / growth stages

Conclusions • Evaluated the performance of various prediction algorithms on AraNet. • SinkSource is the overall best prediction algorithm. • Measured the influence of the number of genes annotated to a function and the source of annotation evidence. • All algorithms perform poorly when only a small number of genes are ‘known’ or when annotating very specific functions. • When only a small number of genes are ‘known’, use only experimentally verified annotations to make new predictions. • When a considerable number of genes are ‘known’, use all annotations to make new predictions.

Conclusions • Measured the correlation of performance with network properties • Several topological properties correlate well with performance. • ‘Average segregation’ has the strongest correlation.

Conclusions • Assessed performance on conserved/plant-specific functions • Performance on basic ‘conserved’ functions is better than that for all the functions. • Specialized ‘conserved’ functions are hard to predict. • Performance on ‘plant-specific’ functions is very poor. • Also a consequence of the fact that ‘plant-specific’ functions generally have small number of annotations.

Conclusions • Avenues for improvement in functional interaction networks • Build functional interaction networks that are based on a larger collection of plant datasets. • If possible, rely as little as possible on data from other species. • Avenues for future experimental work • ‘Plant-specific’ functions and • Specialized ‘conserved’ functions.

Acknowledgements • Arjun Krishnan • Brett Tyler • Andy Pereira

T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan