160 likes | 345 Views
Applications of Data Mining and Machine Learning in Bioinformatics. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Basics of Protein Structures. A typical protein consists of hundreds to thousands of amino acids.
E N D
Applications of Data Mining and Machine Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering
Basics of Protein Structures • A typical protein consists of hundreds to thousands of amino acids. • There are 20 basic amino acids, each of which is denoted by one English character.
20 Amino Acid - 1 Source: http://prowl.rockefeller.edu/aainfo/struct.htm
20 Amino Acid - 2 Source: http://prowl.rockefeller.edu/aainfo/struct.htm
20 Amino Acid - 3 Source: http://prowl.rockefeller.edu/aainfo/struct.htm
Three-dimensional Structure of Myoglobin Source: Lectures of BioInfo by yukijuan
Prediction of Protein Functions • Given a protein sequence, biochemists are interested in its functions and its tertiary structure.
Protein Classification Based on the Homology Model • The sizes of modern protein databases are growing at fast rates. • In order to expedite the process to identify protein functions, it is desirable to classify the concerned protein, before biochemistry experiments are conducted.
One widely used approach to classify proteins is based on the homology model, i.e. classify proteins based on the similarities of amino acid sequences. • BLAST and FASTA are two most widely used software utilities for computing the similarity between two sequences. • We can cluster the proteins in an existing protein database in advance as the next slide exemplifies.
An Example of Similar Protein Sequences 3BP2_HUMAN MAAEEMHWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGTQLQLLKWPLRFVIIHKRCVYYFKSSTSASPQGAFSLSGYNRVMRAAEETTSNNVFPFKIIHISKKHRTWFFSASSEEERKSWMALLRREIGHFHEKKDLPLDTSDSSSDTDSFYGAVERPVDISLSPYPTDNEDYEHDDEDDSYLEPDSPEPGRLEDALMHPPAYPPPPVPTPRKPAFSDMPRAHSFTSKGPGPLLPPPPPKHGLPDVGLAAEDSKRDPLCPRRAEPCPRVPATPRRMSDPPLSTMPTAPGLRKPPCFRESASPSPEPWTPGHGACSTSSAAIMATATSRNCDKLKSFHLSPRGPPTSEPPPVPANKPKFLKIAEEDPPREAAMPGLFVPPVAPRPPALKLPVPEAMARPAVLPRPEKPQLPHLQRSPPDGQSFRSFSFEKPRQPSQADTGGDDSDEDYEKVPLPNSVFVNTTESCEVERLFKATSPRGEPQDGLYCIRNSSTKSGKVLVVWDETSNKVRNYRIFEKDSKFYLEGEVLFVSVGSMVEHYHTHVLPSHQSLLLRHPYGYTGPR 3BP2_MOUSE MAAEEMQWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGTQLQLLKWPLRFVIIHKRCIYYFKSSTSASPQGAFSLSGYNRVMRAAEETTSNNVFPFKIIHISKKHRTWFFSASSEDERKSWMAFVRREIGHFHEKKELPLDTSDSSSDTDSFYGAVERPIDISLSSYPMDNEDYEHEDEDDSYLEPDSPGPMKLEDALTYPPAYPPPPVPVPRKPAFSDLPRAHSFTSKSPSPLLPPPPPKRGLPDTGSAPEDAKDALGLRRVEPGLRVPATPRRMSDPPMSNVPTVPNLRKHPCFRDSVNPGLEPWTPGHGTSSVSSSTTMAVATSRNCDKLKSFHLSSRGPPTSEPPPVPANKPKFLKIAEEPSPREAAKFAPVPPVAPRPPVQKMPMPEATVRPAVLPRPENTPLPHLQRSPPDGQSFRGFSFEKARQPSQADTGEEDSDEDYEKVPLPNSVFVNTTESCEVERLFKATDPRGEPQDGLYCIRNSSTKSGKVLVVWDESSNKVRNYRIFEKDSKFYLEGEVLFASVGSMVEHYHTHVLPSHQSLLLRHPYGYAGPR
When a protein with unknown functions is inputted, the classification software identifies the protein clusters that contain most similar proteins. • The biochemists then can predict the functions of the protein based on the output of the classification software. • The protein clustering conducted in advance expedites the search process.
Applications of Data Classification in Microarray Data Analysis • In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.
For example, in the Leukemia data set, there are 72 samples and 7129 genes. • 25 Acute Myeloid Leukemia(AML) samples. • 38 B-cell Acute Lymphoblastic Leukemia samples. • 9 T-cell Acute Lymphoblastic Leukemia samples.
Applications of Data Clustering in Microarray Data Analysis • Data clustering has been employed in microarray data analysis for • identifying the genes with similar expressions; • identifying the subtypes of samples.