ResponseNet

ResponseNet EstiYeger-Lotem, Laura Riva, Linhui Julie Su, Aaron D Gitler, Anil G Cashikar, Oliver D King, Pavan K Auluck, Melissa L Geddie, Julie S Valastyan, David R Karger, Susan Lindquist & Ernest Fraenkel Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synucleintoxicity, Nature Genetics 2009 Presented by Shahar Zini June 2013

Introduction • Gene expression profiling alone reveals very little about the inner workings of the cell and the mechanism involved. • In the last semester we’ve seen many efforts to combine data from gene expression and protein-protein/protein-DNA knowledge in order to reveal more about the processes that occur within the cell. • ResponseNet is a method which adds another high-throughput experimental data into the mix: genetic screening.

Genetic screening • A method which involves alteration or deletion of an organism’s gene products and the examination of the phenotype of the resulting organism. • Phenotype: the composite of an organism’s observable characteristics • The process: • Pick-up an healthy organism • Delete/alter/overexpress a gene or a group of genes • Test the resulting organism’s phenotype against a desirable characteristic (called “scoring”) • For example: delete the MEP1 gene in a yeast cell and check for impaired growth.

Genetic screening • This process can be used as a high-throughput procedure, meaning one could test alterations of 1000x of genes for numerous desired characteristics. • The set of genes found to affect a process is called “genetic hit”. • Can reveal an answer to some questions like: “What genes, if removed, will cause a yeast cell to die?” • Just like gene expression: lots of data, yet, does not reveal a lot about the procedures the took place inside the cell to cause the observed phenotype.

Motivation • Cells respond to stimuli by changes in various processes. • Efforts to identify components of these responses using both methods (gene expression and genetic screening) yields different sets of genes. • Genetic screens tend to identify response regulators. • Gene expression tend to identify metabolic responses. • Combining the data together may reveal something deeper about the processes of the cell. • It is biologically plausible that some of the reported response regulators will be connected through regulatory pathways to differently expressed genes.

ResponseNet • Identify molecular interaction paths connecting genetic hits and differentially expressed genes. • Create a graph containing 2 types of nodes: genes and proteins. • Genetic hits represented as their corresponding protein • Fill the graph with (mostly bi-directional) edges connecting protein to another protein if they have a known interaction • Fill the graph with directional edges connecting a protein node to a gene node if the protein is probable transcriptional regulator of the gene.

Genetic hit Differentially expressed gene Protein

Stimulus-related proteins intermediary proteins stimulus-related genes TFs The signaling and regulatory sub-network, by which stimulus-related proteins detected by genetic screens may lead to the measured transcriptomic response.

Min-cost flow • Represent the problem as a flow problem. • Each edge’s weight set to the negative log of it’s probability • Try to pass flow from the genetic hits to the differently expressed genes using regulatory proteins. • Strong flow => Highly regulated path => Important component in the process • Rank nodes based on amount of flow to distinguish core component of the cellular response and peripherals.

Source Genetic hit Differentially expressed gene Protein Sink

Weighting scheme – protein-protein • There are K types of evidence for a protein-protein interaction • For example, “Two-hybrid” connection can be tested using direct experiment and using high-throughput experiment • We trust the direct experiment more • We are interested in finding regulatory pathways, and thus we would like to emphasis interactions between proteins that function together in a common response pathway.

Weighting scheme – protein-protein • For each pair of proteins A and B we construct a vector X = (x1,x2…,xk) • Xi is 1  There is evidence of type i that A and B are connected • wab = P(a and b are connected | X) • By bayes’ rule: • = P(X | a and b are connected) * P(a and b are connected) / P(X) • P(X) = P(X | a and b are connected) * P(a and b are connected) + P(X | a and b are not connected) * P(a and b are not connected) • Assume independence between types of evidence: • P(X | a and b are [not] connected) = P(Xi| a and b are [not] connected)

Weighting scheme – protein-protein • In order to calculate values using the equations derived we need to calculate the prior distributions P(Xk | a and b are [not] connected) • They used 3 sets of data: • A set of response pathways • A positive set containing all interacting protein pairs in a common response pathway based on reliable GO annotation (direct assays or expert knowledge) • A negative set of interacting protein pairs known not to be in a common response pathway • They capped the weight to 0.7 because some interactions received a score close to 1 (probably well studied proteins)

Weighting scheme – protein-gene • They used data obtained using “Chip-chip” and “Chip-chip motif” methods. • Those experiments are used to detect transcription factors for genes. • The weighting scheme is irrelevant, but it was set so that the weight of each edge would reflect the interaction’s reliability.

Source Sink Flow problem details • For each genetic hit, the weight and capacity of the edge from the source to the node is set to: • Strength is determined by results of the experiment • For each edge connecting a differentially expressed gene to the sink, set the weight and capacity to: • Strength is determined by fold-change or p-value of the experiment’s results • All other capacities are set to 1 (and weight using the weighting scheme)

Linear programming • Min-cost flow optimization can be solved efficiently using linear programming • The objective function used included a nice twist, so we would investigate it further: • The first term is what you would expect from min-cost-flow problem • Reduce the sum of cost*flow in all edges • The second term is used to control the size of the input connected by flow

Smaller values will result in an empty flow (any flow will increase the cost) • A value of 0 means you will only allow for perfect edges (weight=1) • As the value increases, you give room to more and more imperfection, and allow for lower-probabilities edges to be added • Adding more edges => better coverage • Meaning, we can adjust the coverage of the solution and so the conflict between coverage and accuracy continues.

Results • Validation of algorithm results on well-studied response pathways • STE5 deletion • DNA Damage • Broader validation on less-studied response pathways • Alpha-synuclein toxicity

STE5 deletion • A protein that coordinates a MAPK cascade activated by pheromone • Multi-tier pathway activated by the presence of pheromone • Simple and well-studied response pathway • Used a set of genetic hits associated with STE5 and differentially expressed genes from a yeast strain lacking STE5

DNA Damage • There are many mechanisms in the cell which identifies and fixes damage created to the DNA. • A much more complex pathway, but still highly-studied • The input is very large: • 91 genetic hits • 9 differentially expressed genes (exposed to mutagen) • Large output

Results – DNA Damage • Contains 361 edges between 166 intermediate proteins • Highly enriched for DNA damage stimulus (p<10-14) and DNA repair (p<10-14)

Some of highly ranked parts of the network are proteins that Have been uncovered by years of intense research. (MEC1, RAD53, RFC2,RFC3,RFC4,RFC5, and RFX1) MEC1, RAD53,RFC2,RFC3,RFC4 and RFC5 are essential genes And thus could not have been detected via genetic screening

Broader experiments • In order to evaluate the algorithm’s results better a broader evaluation took place • The writers used data from 101 different perturbation, each involved the inactivation of a single gene • In most cases, the response pathways are poorly studied • The genetic hits contained the genetic interactors of the inactive gene • The differentially expressed genes are based on microchip profiling of the inactivated strain

Broader experiments • In order to validate the results, the identity of the inactivated gene was hidden from the algorithm • The result of each test were considered successful if: • They included the inactivated gene of the specific test and it was ranked significantly well • They were significantly enriched for a specific biological process attributed to the specific gene • Significance was tested by creating a randomized inputs for each test: • 50 times the genetic hits inputs were randomized • 50 times the differentially expressed genes were randomized • A significant solution was one which was better (in ranking / enrichment level) than 95% of the randomized solutions

Broader experiments – results • Finding the inactivated gene was successful in 25 of the tests • In 8 of the tests the results contained the inactivated gene but it was not scored high enough • Finding enriched results where successful in 28 of the tests • The success rate for cases based on incomplete genetic hits was 40% and under full genetic screens it was 47% => An indication that the algorithm works well even under very limited data sets • A typical result covered only 1% of the yeast genes • We can infer that both criteria were important • Overall: 41% success (considered high because of the low quality data)

Alpha-synuclein toxicity • A protein known to be a key component to Parkinson disease. • Despite intense research, the cellular pathways leading to cell death are just beginning to get uncovered. • Genetic screen experiment detected 77 proteins • No significant overlap between genetic hits and transcriptional profiling results were found, meaning, we have lots of data, but no of to explain it • Applying ResponseNet revealed a sub-network connecting 34 genetic hits and 166 differentially expressed genes shading light on the inner processes effecting alpha-synyclein. • Some of the conclusions were tested in the lab, and confirmed. • Real contribution to the understanding of the Parkinson disease!

Freedom of data • The authors complained that many experiments are taking place world-wide but only a few release source codes and complete data sets on the internet • So, they decided to open their algorithm and method to the world by creating a web page that anyone can upload data into and receive results in minutes • One can use yeast data (as used in the article) or can upload it’s own data representing the protein-protein and protein-gene relations in a different organism

Any Questions?

Questions • What is the motivation to connect genetic hits and differentially expressed genes? • Why do the authors capped the weight of each edge? What would happen they would have skipped this stage? • What would happen if we set the gamma parameter to value which is too high? Please send answers to my mail (shshzi at gmail)

ResponseNet

ResponseNet

Presentation Transcript

ResponseNet