400 likes | 567 Views
Project and Report (30% final score). Contents. There are two parts: project+report Project (DNA-binding protein identification) Report Review the methods for DNA-binding protein identification. Point out their advantages and disadvantages.
E N D
Contents • There are two parts: project+report • Project (DNA-binding protein identification) • Report • Review the methods for DNA-binding protein identification. Point out their advantages and disadvantages. • How did you do the experiments? Information for each step. • what are your results? • What are the advantages, disadvantages, and novelty of your methods?
Problem description DNA-binding proteins are very important components of both eukaryotic and prokaryotic proteomes. As approximately at least 2% of prokaryotic and 3% of eukaryotic proteins are able to bind to DNA, these proteins are important for various cellular processes.
Problem description Therefore Developing an efficient model for identifying DNA-binding proteins from non DNA-binding proteins is an urgent research problem. Up to now, Although many efforts have been made in this regard, further effort is needed to enhance the prediction power.
Dataset description There are three datasets in this project, including benchmark dataset, independent dataset1 and independent dataset2, which are available at course website http://bioinformatics.hitsz.edu.cn/course/
Dataset description-benchmark dataset The benchmark dataset has 146 DNA-binding proteins and 250 non DNA-binding proteins, where the DNA-binding proteins are in the file benchmark “(DNA-binding proteins).seq”; the non DNA-binding proteins are in the file “benchmark(non DNA-binding proteins).seq”
Dataset description-independent dataset1 Independent dataset1 contains 82 DNA-binding proteins in file “Independent dataset1(DNA-binding proteins).seq” and 100 non DNA-binding protein in file “independent dataset1(non DNA-binding proteins).seq”
Dataset description-independent dataset2 Independent dataset2 contains 770 DNA-binding proteins in file “Independent dataset2(DNA-binding proteins).seq” and 816 non DNA-binding protein in file “ independent dataset2(non DNA-binding proteins).seq”
Task and evaluation Task: Identify DNA-binding proteins from non DNA-binding proteins. Evaluation scheme: 1.Use validation techniques to optimize the parameters of your methods (if any), and obtain the results on the benchmark dataset 2. Train your classifiers on the benchmark dataset, and predict the proteins in the two independent datasets.
Feature extraction • Extracting the features from the protein sequences, • Using your imagination to extract the features that can capture the character of the protein sequences.
Classifiers • You are free to choose any classifiers, such as Support Vector Machines (SVMs), Artificial Neural network (ANN), Random Forest (RF), etc.
Task and evaluation TP refers to the number of positive samples that are classified correctly; FP denotes the number of negative samples that are classified as positive sample; TN denotes the number of negative samples that are classified correctly; FN denotes that number of positive samples that are classified as negative samples.
Scoring function for the project and report • Novelty and completeness: new features, new machine learning models, etc. Write down what makes your method different from others in this field. Does your method work? (30%) • Mid results and source code (10%) • Results (20%) • Report (40%)
Important information • This is individual work, not team work, so do it alone, but you are free to discuss with others. • Due date: 16th Dec, 2013, all data should be stored in one ZIP or RAR file and sent to TA via email or USB drive. The title of the email and your data: your name + student ID. (If your data is too large, contact TA directly).
Data Driven Machine Learning Approaches for Bioinformatics Training Training Data Prediction Protein Data New Data Classifier: Map Input to Output Split Test Data Test Input: sequence features Output: category Training: Build a classifier Test: Test the model Key idea: Learnfrom known data and Generalizeto unseen data
Prediction • DNA: • 1. Gene identification • 1) Transcription start site (TSS) prediction; • 2) Translation initiation site (TIS) prediction; • 3) Coding sequence prediction; • 4) exon, intron prediction; • 5) exon/intron splice site prediction; • 6) alternative splice site prediction; • 7) first exon, terminal exon prediction; • 8) non-coding gene (RNA) prediction; • 9) operon (Transcription Units) prediction;
Prediction • 2. Motif identification • 1) Transcription factor binding site (TFBS) prediction; • 2) Promoter prediction; • 3) Terminator prediction; • 4) DNase hypersensitive site prediction; • 5) Ribosome binding site (RBS) prediction; • …… • 3. Structure prediction • 1) DNA advance structure prediction; • 2) RNA second structure; • 3) Superhelix; • 4) UTR structure; • ……
Prediction • 3. Other prediction • 1) Replication origin prediction; • 2) CpG prediction; • 3) Isochore; • 4) Alu sequence; • 5) Gene expression; • 6) k-tuple research; • 7) Methylation prediction; • ……
Prediction • Protein: • 1. Protein structure prediction • 1) Second structure prediction; • 2) Tertiary (quaternary) structure prediction; • 3) Supersecond structure prediction; • 4) Protein structural class prediction; • …… • 2. Protein subcellular localization • 1) Eukaryote; • 2) Prokaryocyte; • 3) Apoptosis; • 4) Submitochondria; • 5) Subnuclear; • ……
Prediction • 3. Modified site • 1) Glycosylation; 2) Sumoylation; • 3) Palmitoylation; 4) Acetylation; • 5) DNA-binding residues; 6) Phosphorylation • 7) Disulfide • …… • 4. Other prediction • 1) Superfamily and family; • 2) G protein-coupled receptors; • 3) Enzyme; • 4) Mesophilic and thermophilic; • 5) Signal peptide; • 6) Histones; • ……
Cross-validation • In literatures, the following three cross-validation methods are often used to evaluate the quality of a predictor • Self-consistency; • Independent test • n-fold cross-validation • Jackknife cross-validation
Performance measure TP refers to the number of positive samples that are classified correctly; FP denotes the number of negative samples that are classified as positive sample; TN denotes the number of negative samples that are classified correctly; FN denotes that number of positive samples that are classified as negative samples.
Several important components in this model • Feature extraction. • Given a protein, how to extract features only based on the primary sequence? Brainstorming?
Several important components in this model • Predictor • SVM, ANN, HMM, CRF, RF etc • redundancy • Blast • cd-hit:http://bioinformatics.ljcrf.edu/cd-hi/ • PISCES:http://dunbrack.fccc.edu/PISCES.php
A study case: DNA-binding Protein Identification 1. DNAbinder 2.DNA-prot 3.iDNA-Prot
DNAbinder Classifier: SVM Feature: proposed three feature extracting methods One is to encode the evolutionary information into a feature vector of 21 dimensions called PSSM-21 and its element is simple composition of occurrence of each type of amino acids, calculated by summing over each column (residual position) of PSSM.
DNAbinder The second way is to encode a sequence into a feature vector with 420 dimensions called PSSM-420, of which the element is composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence, meaning that it has 20 values instead of one for each column. The last one called PSSM-400 which is similar to PSSM-420 except dummy residue ‘X’ is ignored
DNA-prot Classifier: Random Forest Feature: 1.Amino acid composition, group Amino acid composition,composition of hydrophobic, hydrophilic, and neutral amino acids and Twenty seven tripeptides were derived from all possible combination of hydrophobic, hydrophilic, and neutral amino acid groups. 2.information about the short peptides (10 residue length, in this case) that are rich in hydrophobic, hydrophilic, or neutral amino acids
DNA-prot 3.Secondary structure: The overall composition of helix (H), beta sheet (E), coil (C), and the frequencies of 10 amino acid group, hydrophobic, hydrophilic, and neutral amino acids at helix, sheet, and coil regions. 4.consider 14 physico-chemical properties that include molecular weight, hydrophobicity, hydrophilicity, refractivity, average accessible surface area, flexibility, melting point, side chain volume, side chain hydrophobicity, normalized frequency of beta-sheet and alpha helix, polarity, heat capacity, and isoelectric points
iDNA-prot Classifier: Random Forest Feature: Pseudo Amino Acid Composition 1.Amino acid composition in a sequence. 2. Grey model parameters: First denote the sequence with a series of real numbers. Second, applying the grey dynamic model called GM(2,1) to compute the Grey model parameters.