280 likes | 408 Views
Network inference from repeated observations of node sets. Saket Bhat CS502. Contents. What is Network Inference? Network Inference using related entity sets. Inference Approach. Results on synthetic data. Analytical approximation. Accuracy of Inference.
E N D
Network inference from repeated observations of node sets Saket Bhat CS502
Contents • What is Network Inference? • Network Inference using related entity sets. • Inference Approach. • Results on synthetic data. • Analytical approximation. • Accuracy of Inference. • Applications to systems biology & systems pharmacology.
Overview • Network inference - The deduction of an underlying network of interactions from indirect data. • A general class of network inference problem. • Network inference approach. • Application: • Inference of physical interactions: PPI • Mount Sinai collaboration network.
Examples of network Inference Protein-Protein Interaction Cell Signaling
Inference using repeated sets • Vertices are entities and edges are relations. • Methods to infer networks from quantitative biological data are often used, co-occurrence networks from related sets not used much. • Popular method of gene set enrichment analysis(GSEA) uses libraries of related gene sets stored in a Gene transposed matrix(GMT). • Each subset of related genes provides coarse info about underlying network structure.
The inference problem • Input: a set of entities (genes or proteins or ...) in the form of a GMT file - the results of experiments, or sampling more generally. • Assumptions: • 1 An underlying network exists which relates the interactions between the entities in the GMT file • 2 Each line of the GMT file contains information on the connectivity of the underlying network • The problem: Given a GMT file can we extract enough information to resolve the underlying network?
Exponential random graphs model • Used to generate an ensemble of network with statistical properties. • Set of graphs G represents a sample space in the model and , i=1,2…r , represent empirical values observed. • Probability distribution is defined over elements g of G, where empirical constraints are satisfied. The following are the mathematical equations involved: = <>
Dependence Graphs • ERGM can be generated in terms of dependence graphs • Dependence graphs can be represented in terms of adjacency matrix which is symmetric .they represent undirected simply connected underlying networks. • = 1 , if vi and vj are connected by a vertex in G = 0 otherwise • The model probability distribution P(G),only depends on the complete sub graph of the dependence graph.
Approach • Forget for the moment that we know the underlying network and pretend we only have the GMT file. • Attempt to use the accumulation of our course data to infer the fine details of the underlying network. • Consider the set of all networks that are consistent with our data - there are likely to be many. • Use an algorithm to sample this ensemble of networks randomly. • The mean adjacency matrix gives the probability of each link being present within the ensemble.
Analytic Approximation • When applying this approach to real data typically there are large numbers of nodes • Sample space of networks can be very large -> computationally demanding • Write a simple analytical approximation which mimics the action of the algorithm.
Correction for sampling bias • Destroy any information by a random permutation of the GMT file and compare the actual edge weight to the distribution of edge weights from the randomly permuted GMT files:
Accuracy of Inference • How similar is the inferred network to the underlying network ? • Quality of the inference depends on three parameters: • GMT file. • Length of lines in GMT file. • Number of nodes in the underlying network.
Applications The applications of the network inference from repeated observations of sets to systems biology which are discussed by the authors are: • inference of physical interactions: PPI • Inference of gene associations: Stem cell genes • inference of statistical interactions: Drug/side effect network • We discuss here in detail the first one of protein protein interactions
Application to Infer PPIs Malovannaya A et al. Analysis of the human endogenous coregulator complexome. Cell. 2011 May 27;145(5):787-99
Validataion • Compare inferred PPI network to the following databases: • BioCarta • HPRD PPIInnateDB • IntAct • KEGG • MINT mammalia • MIPS • BioGrid
Combining networks • Each data source gives a different perspective on the associations between the genes • New insights may possibly be gained by combining the different perspectives. e.g. small but consistent associations across different perspectives will be revealed by the enhanced signal-to-noise ratio.
Mount Sanai Collaboration network • This is a broader application of the network inference approach. • The GMT representation lends naturally to the inference of co-authorship networks • PubMed E-utilities’ E-search function used to search the latest (early May 2012) publications that contain an affiliation equal to the term Mount Sinai School of Medicine. • Extracted the author list using the E fetch function. • For each paper, the data was formatted into a GMT file with the PubMed ID as the set label and each author of each paper as the members of each set.
Conclusion • In cases where direct determination of the network is difficult or impossible it is necessary to use indirect evidence which can be more easily obtained. • The authors have shown by providing statistical results that approximation is a better technique rather than using full algorithm when data set is huge and a GMT file. • Useful for addressing problems in current biology and biomedicine, the approach is of general significance and can be applied in other fields that study complex systems.
Future Work Three Future research questions on the following topic would be: • Relevance of network inference in Data mining techniques and machine learning.(how effectively it can be done) • Social network analysis using network inference on related sets.(FB and Twitter) • Tracing complexity of network inference.