350 likes | 605 Views
prediction of proteins that participate in learning process by machine learning. Dan Evron Miri Michaeli Project Advisors: Dr. Gal Chechik Ossnat Bar Shira. Biological Background. A synapse is a junction between 2 neurons. How does Synaptic Transmission works?.
E N D
prediction of proteins that participate in learning process by machine learning Dan Evron Miri Michaeli Project Advisors: Dr. Gal Chechik Ossnat Bar Shira
Biological Background A synapse is a junction between 2 neurons. How does Synaptic Transmission works?
Hebbian theory Donald Hebb: • "When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased"
Synaptic Plasticity • synaptic plasticity is the ability of the synapse to change in strength by molecular alteration. • What kind of alterations happen during synaptic plasticity?
Synaptic Plasticity changes example Change in the probability of glutamate release. Insertion or removal of postsynaptic AMPA receptors. phosphorylation and de-phosphorylation inducing a change in AMPA receptor conductance. • Pre synaptic release probability. • The number of postsynaptic receptors. • Properties of postsynaptic receptors.
What is the connection to learning and memory? synaptic plasticity is one of the important neurochemical foundations of learning and memory.
Learning in Aplysia • Habituation • Sensitization • Classical conditioning • All found in the gill withdrawal reflex !!! • Kendel’s work connects organism level learning to cellular level learning !!!
And what about us? • in mammals: • Many of the pathways are far from understood. • Much bigger and complex nervous system. • Research shows that many principals are the same (LTP/LTD in the Hippocampus).
Project Idea & Goal Biological research has found many proteins which are connected to biological pathways involved in learning in the neuron and synapse. Yet, pathways are far from understood and many components are missing. Our goal is to find candidate proteins that may take part in these pathways and have not been discovered yet.
How will we do that? • Collect numerical data on organism proteins. • Collect ontologies about synaptic plasticity • Label each gene as related / non related to synaptic ontologies (according to data) • Use SVM as a classifier • Search for false positive genes in results • Publish a great article and win a Nobel prize! (or just dream about it…)
Our research organism is… Mus musculus AKA... The house mouse!
Tools & Databases • GEO (GeneExpression Omnibus) • MGI (Mouse Genome Informatics) • GO (Gene Ontology) • MPPDB (Mouse Protein-Protein Interaction Database) • SynDB (Synapse Database)
Tools & Databases • Classifier: SVM (Support Vector Machine)
The project had 2 main phases: • Phase 1: • Work only on PPI data • Create baseline for further work • Phase 2: • Increase our PPI data • another data type: gene expression • Combine the PPI and GE data • Try to improve prediction !!
Phase 1: • Extract PPI data from BioGRID • Label the matrix for each ontology • Perform SVM algorithm on the sets • Calculate baseline
Phase 1 - results • Most ontologies had only few related genes - problematic. • Baseline:
Phase 2 will another type of data improve the results? Gene expression
Step 1 - extracting data • Representative set of mouse proteins from MGI. • Gene Expression data from experiments related to synaptic and neuronal learning. • Mouse Protein Protein Interaction (PPI) from several data bases. • gene ontologies from GO. • Synaptic ontologies from SynDB.
Step 2 – processing data • Each gene expression data comes in separate files - need to be combined. • Normalize gene expression data. • Create PPI’s matrix. • Convert PPI’s proteins to genes.
Step 3 - combine the data According to the list of genes: • Matrix that combine PPI&GE when each gene has at least one data type. (“union”) • Matrix that combine PPI&GE when each gene has both data types. (“intersect”) • PPI matrices from the two mentioned matrices • GE matrices from the two mentioned matrices
Step 4 - labeling the data • For each set, and each ontology we labeled the genes (related/non related).
Step 6 - process the results • Evaluate prediction success (AUC). • Find potential false positive candidates. So how did we do? We have to build a ROC curve before..
What is ROC ? ROC = Receiver Operating Characteristic. • Our SVM builds a ROC curve - that is a graphical plot of the sensitivity vs. specificity. • During the SVM run-time, it calculates the AUC of the ROC curve made by it after classification.
What is AUC ? • AUC = Area Under the Curve. • The AUC is a way to evaluate accuracy of the learning model by averaging the prediction precision. • The AUC spans between 0.5 and 1, when 0.5 shows that the test has a 50% precision (equals to tossing a coin!) and 1 indicates a perfect precision ability. • The AUC enables us to examine and compare SVM results.
Results • Intersect of the data: • Size of all 3 matrices is similar – enables comparison. • Average AUC: GE alone: 75% PPI alone: 63% GE + PPI: 75%
Results • Union of the data: • Close to reality in number of genes (14K in matrices, 15K in representative list) • Average AUC in GE alone = GE + PPI = 74% • The matrices size issue • PPI alone corresponded to different GO categories, so can not be compared.
Conclusions • We can compare between different types of data only from the “intersect” mats. • In intersect, the PPI sets the size, therefore we have same GO categories. • In union, GE size took over the PPI data and that is the reason for different GO categories (GO categories in both PPI’s are the same). • PPI did not contribute to prediction ! (bad news…)
The good news… • Still, 75% is a nice accuracy! • We found several false positive genes, that may be related to synaptic plasticity and have not been discovered yet as such. examples: • Neurogranin (NRGN) • CADPS
Neurogranin (NRGN) Acts as a "third messenger" substrate of protein kinase C-mediated molecular cascades during synaptic development and remodeling. Binds to calmodulin in the absence of calcium.
Ca++-dependent secretion activator(CADPS) Calcium-binding protein involved in exocytosis of vesicles filled with neurotransmitters and neuropeptides. Probably acts upstream of fusion in the biogenesis or maintenance of maturesecretory vesicles.
Next steps.. • Computationally: • Improve the classification by adding new types of data and / or by different representation of the data. • Biologically: • Explore through biological experiments the proteins we have found (the FP list).