Guan N. Lin (Nick) Bioinformatics Intern

Bioinformatics Prediction of Plant Protein-Protein Interaction Using sequence Only Guan N. Lin (Nick) Bioinformatics Intern

Outline • Project Background • Goals and Obstacles • Tool Development & Method Design • Results Analysis • PPI prediction for leading genes • Acknowledgements

Protein-protein interaction (PPI) • PPI: • Each living cell is packed with proteins that continuously interact with each other to control the cell's growth, function and eventual fate. • They have effects on altering protein kinetic properties, substrate binding, catalysis, etc. • Researchers have developed a variety of chemical and biochemical techniques to understand the who, what, where, when and why of those interactions.

Systems biology: From cell to network

PPI (Protein-protein interaction) prediction • A study combining bioinformatics and structural biology to identify and catalog interactions between pairs or groups of proteins. • Determination by experiments: • yeast 2-hybrids, affinity purification, co-immunoprecipitation, etc. • Prediction by computations: • Model building through pattern discovery using sequences, protein structural information, evolutionary information, etc. • PPI network construction provides important insight in investigating intracellular signaling pathways.

Outline • Project Background • Goals and Obstacles • Tool Development & Method Design • Results Analysis • PPI prediction for leading genes • Acknowledgements

Project goals and obstacles • Goals • Using parts of free tools and open-source codes to build a PPI prediction pipeline system based on protein sequence information only. • Using cross-species PPI data, such as Human, Drosophila, Yeast and C. elegans, to do genome-scale plant PPI prediction. • Obstacles • Open-source codes lack of organizations and descriptions for system integration. • Computational complexity hinders the analysis speed within limited amount of time. • Difficult to generalize the consistent pattern from cross species data.

Outline • Project Background • Goals and Obstacles • Tool Development & Method Design • Basic scheme • Tool development • Model design and tuning • Results Analysis • PPI prediction for leading genes • Acknowledgements

Basic scheme (how do we do it?) Rationale: PPIs are basic structural elements for molecular circuitries in biological systems and will provide valuable insights for optimization/MOA Training data (sequences of interacting proteins) Predict new interactions from sequences SVM Kernel classifier Sequence patterns Validation Training set for SVM kernel classifier = Positive training set (experimental interactions, some for training, some for validation) + Negative training set (mostly random generated pairs)

Using Conjoint Triads for sequence pattern construction • Reduced-alphabet sequence pattern training: • Classify 20 AA types into 7 classes based on their properties (hydrogen bonding, hydrophobic, volumes of sidechains, etc). • Build AA triplets using 7 classes, called “conjoint triad” (343 unique types). Save in V • Calculate frequency of each triad for each protein sequence. Shen, PNAS 2007

System/Tool design flowchart Java Codes C/C++ Codes SVM Prediction Input Sequence SVM Training Build sequence pattern Test sequences Conjoint Triads Optimize parameter (C, γ) SVM test input + SVM training model Triads Frequency Build SVM training model Prediction SVM training Input Negative PPI pairs are generated based on proteins positive PPI pairs. If AB and IJ are positive PPIs, then AI, AJ, BI and BJ could be considered the negative pairs. # of negative pairs = # of positive pairs Prepare training Evidence Generate negative PPI pairs Raw PPI file

Screenshot of the PPI prediction tool

Public available experimental data • Arabidopsis • 4,400 PPI pairs (Tair, Biogrid, intAct), 3,000 genes • C. elegans • 5,400 PPI pairs (Biogrid, intAct) • Human • 23,000 PPI pairs (HPRD, intAct), 6,900 genes • Drosophila • 24,000 PPI pairs (intAct), 7,000 genes • Yeast • 48,000 PPI pairs (Biogrid, intAct), 7,000 genes

SVM for triad pattern model training and tuning SVM training parameters SVM parameters optimization is performed using grid-search procedure. Parameters: C – cost : to minimize training error (value range: 0.125 -> 512) γ – kernel gamma: maximize training capability (value range: 0.125 -> 8)

Outline • Project Background • Goals and Obstacles • Tool Development & Method Design • Results Analysis • Preliminary results and problems • Further method modification • Further results • PPI prediction for leading genes • Acknowledgements

Accuracy measurements Real Outcome Predicted Outcome Sensitivity Specificity Sensitivity = TP/(TP + FN) Specificity = TN/(FP + TN)

Preliminary results and observations • Prediction for Arabidopsis 2,600 positive PPI + 2,600 negative PPI using different data sets without any filtering or processing. Observations: 1. Overall low accuracies. 2. Different species data exhibit very different prediction pattern, some like Human and Yeast have completely different prediction extreme patterns. => Conclusion: not meaningful and useful predictions so far.

Carefully selection of subsets of cross species data for training is essential to get valid results • Using GO (Gene Ontology) slim category for data filtering • Red bar: Arabidopsis whole genome proteins • Blue bar: Arabidopsis PPI proteins • It shows correlation of 0.92 between them • Proteins from PPI does represent overall trend of whole genome • Filtering species data by GO Tair slim.

How to categorize proteins into GO slim terms - using GO level indexing Step1: make GO index Ontology files Step2: link GO index to genes Step3: get GO slim term GO_Index Gene to GO association Files • YBR085W -> GO:0055085 transmembrane transport | 3-10-5-44 Developmental process(GO:0007252) : 3-9-26 Transport(GO:0006810) : 3-10-5 Signal transduction(GO:0007165) : 3-7-7-15 … YBR085W belongs to “Transport” slim category

Next step: Using slim category frequency distribution to select subsets of cross-species data Use percentages shown in Arabidopsis data to select similar subsets for other species

Models results comparison with modified datasets Test data: 1917 positive-evidence Arabidopsis PPI pairs + 1915 negative Arabidopsis PPI pairs. The probability of predict a random pair to be a true PPI is 2.6%. Observation: The modified datasets are able to remove almost all negative pairs.

Using ROC curves to show the powers of model prediction are much better than random prediction. Prepared by Xiao Yang

Model prediction pattern correlations Note: 1. "cdhs“ means combining c. elegans, drosophila, human and yeast data together. 2. Drosophila dataset has the poorest prediction trend correlation with Arabidopsis dataset. 3. Combined dataset exhibits the stronger correlation than any other individual dataset.

Summary • Built an easy use and successful system for PPI prediction based on sequence information only. • Construct the PPI prediction models and prove the concept of using cross-species information for plant species PPI prediction in case of lacking of experimental information. • Apply PPI prediction for leading genes MOA study.

Acknowledgements • Zheng Li • J.D. Liu • Everyone in bioinformatics team • Paggy Sullivan and University relations

Guan N. Lin (Nick) Bioinformatics Intern