Andrej Bugrim GeneGo, Inc.

Andrej Bugrim GeneGo, Inc. Protein scoring based on significance in biological networks

Two problems of systems biology • How to reconstruct condition-specific networks in biologically robust way • How to utilize reconstructed networks in day-to-day laboratory practice Still need to answer questions centered on individual genes/proteins: • Which genes are most important for a condition/disease? • What are the best drug targets? • What are the most robust biomarkers?

Sources of the problems • Biological networks are very interconnected due to presence of hubs. Hubs almost always provide “shortest path” connectivity • Multiple paths can be generated to connect a pair of nodes - no way to discriminate between alternative hypothesis • Resulting networks are often large and biologically intractable. It is hard to understand roles of individual nodes

Some earlier solutions • Use “canonical pathways” as basis for reconstruction • Limited to known pathways • Penalize hubs when reconstructing networks • Does not discriminate between individual hubs

Our solution Find nodes that are significant in providing connectivity in condition-specific dataset

A Topologically significant Not topologically significant Finding topologically significant nodes B C 4 out 6 under nodes regulated by B are differentially expressed: more than random share = significant Only 1 out of 6 nodes regulated by C is differentially expressed: could be due to random event = not significant In reality algorithm also considers nodes beyond first-degree neighbors Differentially expressed genes Non-differentially expressed genes

Why JAK1 is significant in this dataset? Regulation via JAK1 Feedback loops • JAK1 provides essential network conduit between PLAUR and many differentially expressed targets of STAT1 • Topological significance helps to find important links in pathways that do not come up on HT screens

Node scoring algorithm • Let K be a set of experimentally-derived nodes of interest (e.g. nodes representing differentially expressed genes). K is the subset of the global network of size N. • Calculate shortest path network S by building directed paths from each node in K to other nodes in K, wherever possible. S is a subset of N and may contain nodes in addition to K. Also some nodes from K may become part of S • Lets consider node i є (S) and one of the nodes of the experimental set j є K. • Calculate the shortest path networks between j and every other node in the global network (N-1 pairs) and count how many of them contain i. This number is Nij < N-1. • Calculate the shortest path networks between j and all other nodes in the experimental set and count how many of them contain node i This number Kij < K-1. • The probability that node i would be present Kij-times or more in the shortest path networks of i by chance follows a hyper-geometric distribution: • Repeat the procedure for all nodes (j) in the subset, calculating Kp-values for node i (pij), each of these values showing relevance of node i to individual members of the set K. As we want to identify the nodes which are statistically significant to at least one or more members of the experimental set we define the p-value associated with node i as the minimum of the pij values.

Algorithm validation: PSORIASIS • Psoriasis is recognized as the most common T cell-mediated inflammatory disease in humans. • Genetic linkage to as many as six distinct disease loci has been established but the molecular etiology and genetics remain unknown. • To begin to identify psoriasis disease-related genes and construct in vivo pathways of the implicated processes, genome-wide expression screens of psoriasis patients need to be undertaken • The disease-related gene map may provide new insights into the pathogenesis of psoriasis

Data • 4 samples from 4 psoriasis patients were taken at 2 different times • At the time of developed psoriatic lesion (P) • And at the time of its complete healing (N) • The samples were taken from the same exact spot on the same patient, which eliminates a great deal of experimental bias and uncertainty. • Affymetrix Human U95A microarray technology was then utilized to evaluate the expression data • Only the differentially expressed genes between the sample from the lesion (P) and the from the normal (N) were then used for comprehensive analysis with new algorithm and in MetaCore 4.0

Algorithm validation • As “experimental set” we use 266 differentially expressed genes identified in the paper • The shortest path network connecting these genes is built using global network of protein interactions from MetaCore™. Statistical significance of each node in this network is calculated as described above • To evaluate whether the nodes deemed significant by our method are indeed likely to be disease-related we perform automated search of PubMed abstracts for co-occurrence of corresponding gene name and word “psoriasis” for every gene in the shortest path network. Different statistical measures are plotted as function of node’s p-value • Functional analysis of high-scored genes is performed in MetaCore™

Fraction of genes related to “psoriasis” scales with significance

High-scoring nodes have higher fraction of psoriasis hits

Enrichment with psoriasis hits among differential genes

No correlation with node degree

Functional analysis: GeneGo processes

Functional analysis: IFN-gamma map

VEGF – key pathway identified! Simonetti O, Lucarini G, Goteri G, Zizzi A, Biagini G, Lo Muzio L, Offidani A. VEGF is likely a key factor in the link between inflammation and angiogenesis in psoriasis: results of an immunohistochemical study. Int J Immunopathol Pharmacol. 2006 October-December;19(4):751-760

Glucocorticoid – another key pathway

Conclusions from algorithm validation • High-scored nodes are significantly enriched in disease-related genes • Important disease-related pathways are identified • Important drug targets are highly scoed

Integration of genomic and proteomic sets • LNCap prostate cell lines • Treated with Androgen • Untreated - control • Data: • Proteomic data - ~ 70 proteins exclusively present in treated cells • Gene Expression profiling of Androgen-treated cells • Analysis • Topological analysis of Androgen-specific protein network • Correlation between topologically significant nodes and gene expression • Functional analysis in MetaCore™ • Network analysis in MetaCore™

Revealing regulation of LNCaP cells response to Androgen Topologically significant nodes reveal regulation Gene Expression and Proteomic data reveal target pathways by differentially expressed genes by Androgen-specific proteins by topologically significant node

Correlation between expression and significance Among topologically significant genes the fraction of differentially expressed genes is high P-value related to differential expression P-value related to topological significance

Androgen receptor signaling 1- Differentially expressed gene 2 – Androgen-specific protein 3- Topologically significant node

Regulation of lipid Metabolism Topologically significant nodes revealed by the new algorithm Differentially expressed genes identified by microarray and confirmed by proteomic screen

Fatty acid metabolism: target pathway

Role of PBEF

Possible regulation of PBEF by AR PBEF occurs in both, expression and proteomic datasets – possibly activated by androgen receptor via HIF1 or HNF4

Possible feedback from Insulin and IGF-1R back to AR

Conclusions • Presented method allows assigning priority to nodes in biological networks built on condition-specific datasets • The presented method is able to predominantly select genes with high relevance to condition of interest • The presented method could be used for cross-validation of different datatypes, identification of novel drug targets and validation of existing targets

Putting it all together: network activity inference • Identifying causal relation between putative input and output signals • Tracking effects of molecular perturbation trough activation/inhibition cascades Predicted input Scoring intermediary nodes Experimental data Experimental data: terminate cascade Predicted target Experimental data: start cascade Inferred activity

“Druggable” network modules

Acknowledgements GeneGo Zoltan Dezso Yuri Nikolsky Tatiana Nikolskaya University of Michigan Adaikkalam Vellaichamy Saravana M Dhanasekaran Arun Sreekumar Arul Chinnaiyan Gilbert Omenn

Andrej Bugrim GeneGo, Inc.

Andrej Bugrim GeneGo, Inc.

Presentation Transcript

Andrej Bugrim GeneGo, Inc

Dr. Andrej Mošat`

ANDREJ ENGELMAN