400 likes | 485 Views
Bioinformatics The Prediction of Life. Tony C Smith Department of Computer Science University of Waikato tcs@cs.waikato.ac.nz. Bioinformatics. Computation with biological data Data: genes, proteins, microarrays, mass spectra, written documents, populations of organisms …
E N D
BioinformaticsThe Prediction of Life Tony C Smith Department of Computer Science University of Waikato tcs@cs.waikato.ac.nz
Bioinformatics Computation with biological data Data: genes, proteins, microarrays, mass spectra, written documents, populations of organisms … Goal: knowledge discovery Bioinformatics Tony C Smith
The essence is prediction … My dog is very littl_ ? • We know that letters do not occur in English at random; not all letters are equally common (e.g. ‘e’ is more common than ‘x’) • We know that context changes the probability of a letter (e.g. what’s the most likely letter after the sequence “I eat Weet-Bi_”) • Prediction is important in many applications (e.g. encryption, compression, communication, graphics, simulation … and bioinformatics!) Bioinformatics Tony C Smith
Prediction in bioinformatics • Predicting the location of genes in DNA • Predicting the function of proteins • Predicting diseases from molecular samples • Predicting population dynamics • Anything that involves “making a judgment”; typically expressible as a yes/no decision about some sample datum Bioinformatics Tony C Smith
Representation W e e t – B i x 0101011101100101011001010111010000101101 … … to the computer, everything is binary! Bioinformatics Tony C Smith
0101011101100101011001010111010000101101 0101101100100111111011010011010000101101 A A C G T C A T T C G A T G A T T C G A Just as we can teach a computer to predict things about a sequence of letters in English prose, we can also teach it to predict things about a other sequences—like a genetic sequence Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagc Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactc Bioinformatics Tony C Smith
A genetic prediction problem • A gene encodes a protein • It is a blueprint that provides biochemical instructions on how to construct a sequence of amino acids so as to make a working protein that will perform some function in the organism Bioinformatics Tony C Smith
RNA RNA RNA RNA RNA transcription factor A genetic prediction problem untranslated region encoding region Bioinformatics Tony C Smith
A genetic prediction problem untranslated region Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc untranslated region Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc What transcription factors bind to this gene? Where is the transcription factor binding site? Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: A binding site is often a short general pattern E.g. CCGATNATCGG Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: The patterns are often reverse complements E.g. CCGATNATCGG GGCTANTAGCC Bioinformatics Tony C Smith
A genetic prediction problem ttgcaatcggcgctacgcttcaaaatttattatattcccggc Clues: Where there is one binding site, often there is another nearby. Bioinformatics Tony C Smith
A genetic prediction problem All of these properties are the kinds of things for which computer science has developed algorithms and data structures to identify quickly and efficiently, and therefore it is exactly the kind of problem computer scientists should be able to solve. Bioinformatics Tony C Smith
proteomics Three consecutive nucleotides in the coding region form a ‘codon’ … i.e. encode an amino acid. A string of amino acids makes a protein. 3 nucleotides, 4 possibilities for each, so 43 = 64 possible codons But there are only 20 amino acids! Bioinformatics Tony C Smith
proteomics There is quite a bit of redundancy in codons. Glycine: GGA, GGC, GGG, GGT Tyrosine: TAT, TAC Methionine: ATG Bioinformatics Tony C Smith
Amino Acid R group Amide group Carboxyl group Bioinformatics Tony C Smith
Amino Acid tyrosine glycine Bioinformatics Tony C Smith
Primary structure: MSALVSTTPSLLAGVRNVDB ….. Bioinformatics Tony C Smith
Tertiary Structure Bioinformatics Tony C Smith
Secondary Structure Bioinformatics Tony C Smith
Signal peptide • A relatively short sequence of amino residues at the N-terminus of the nascent protein typically 15-50 residues MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALTGL … • Cleaved off as protein passes through membrane (operates like a pass key) • Knowing signal peptide helps determine protein function in the organism Bioinformatics Tony C Smith
How do we do it? see any patterns?ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatttcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccggcgcggctacgttcatcccagcattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcaggacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagatgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgcatacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacaccagttatatagagacgaactcttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatatcgc Bioinformatics Tony C Smith
Local biases in residues around the cleavage site Sequence regularities can be exploited by statistical and pattern-based models Bioinformatics Tony C Smith
Proteomic prediction Language:• letters combine to form words • words combine to form phrases • phrases combine to form sentences • sentences combine to form sentences (and ultimately Harry Potter books) Proteins: • amino acids combine to form peptides • peptides combine to form secondary motifs (e.g. α-helixes and β-sheets) • motifs combine to make proteins • proteins combine to make toenails (and ultimately people) Bioinformatics Tony C Smith
Approach • Problem is stated as two-class: an amino acid is either the first residue of the mature protein or it is not • Each residue is described by a single document, which includes as many electrochemical, structural or contextual facts as are available (desirable) Bioinformatics Tony C Smith
Properties of amino acids Bioinformatics Tony C Smith
Residue as a document E.g. Cysteine Cys C aliphatic [yes], aromatic [no], hydrophobic [yes], charge [-], polarized [yes], small [no], number of nitrogen atoms [1], contains sulphur [yes], has a carbon ring [no], ionized [yes], valence [2], cbeta [no], covalent [yes], h-bond [yes], etc. (whatever else experimenter wants to include) Bioinformatics Tony C Smith
Sample document PRNUM:1. AANUM:21. AMINO[-8]:L. ALIPH[-8]:-. AROMA[-8]:-. CBETA[-8]:-. CHARG[-8]:-. COVAL[-8]:-. HBOND[-8]:-. HPHOB[-8]:+. IONIZ[-8]:-. NITRO[-8]:1. POLAR[-8]:-. POSNG[-8]:0. SMALL[-8]:-. SULPH[-8]:-. TEENY[-8]:-. CRING[-8]:-. VALEN[-8]:2. AMINO[-7]:L. ALIPH[-7]:-. AROMA[-7]:-. CBETA[-7]:-. CHARG[-7]:-. COVAL[-7]:-. HBOND[-7]:-. HPHOB[-7]:+. IONIZ[-7]:-. NITRO[-7]:1. POLAR[-7]:-. POSNG[-7]:0. SMALL[-7]:-. SULPH[-7]:-. TEENY[-7]:-. CRING[-7]:-. VALEN[-7]:2. AMINO[-6]:F. ALIPH[-6]:+. AROMA[-6]:+. CBETA[-6]:-. CHARG[-6]:-. COVAL[-6]:-. HBOND[-6]:-. HPHOB[-6]:+. IONIZ[-6]:-. NITRO[-6]:1. POLAR[-6]:-. POSNG[-6]:0. SMALL[-6]:-. SULPH[-6]:-. TEENY[-6]:-. CRING[-6]:+. VALEN[-6]:2. AMINO[-5]:A. ALIPH[-5]:-. AROMA[-5]:-. CBETA[-5]:-. CHARG[-5]:-. COVAL[-5]:-. HBOND[-5]:-. HPHOB[-5]:-. IONIZ[-5]:-. NITRO[-5]:1. POLAR[-5]:-. POSNG[-5]:0. SMALL[-5]:+. SULPH[-5]:-. TEENY[-5]:+. CRING[-5]:-. VALEN[-5]:2. AMINO[-4]:T. ALIPH[-4]:+. AROMA[-4]:-. CBETA[-4]:+. CHARG[-4]:-. COVAL[-4]:-. HBOND[-4]:+. HPHOB[-4]:-. IONIZ[-4]:-. NITRO[-4]:1. POLAR[-4]:+. POSNG[-4]:0. SMALL[-4]:+. SULPH[-4]:-. TEENY[-4]:-. CRING[-4]:-. VALEN[-4]:2. AMINO[-3]:C. ALIPH[-3]:+. AROMA[-3]:-. CBETA[-3]:-. CHARG[-3]:-. COVAL[-3]:+. HBOND[-3]:+. HPHOB[-3]:+. IONIZ[-3]:+. NITRO[-3]:1. POLAR[-3]:+. POSNG[-3]:-. SMALL[-3]:-. SULPH[-3]:+. TEENY[-3]:-. CRING[-3]:-. VALEN[-3]:2. AMINO[-2]:I. ALIPH[-2]:-. AROMA[-2]:-. CBETA[-2]:+. CHARG[-2]:-. COVAL[-2]:-. HBOND[-2]:-. HPHOB[-2]:+. IONIZ[-2]:-. NITRO[-2]:1. POLAR[-2]:-. POSNG[-2]:0. SMALL[-2]:-. SULPH[-2]:-. TEENY[-2]:-. CRING[-2]:-. VALEN[-2]:2. AMINO[-1]:A. ALIPH[-1]:-. AROMA[-1]:-. CBETA[-1]:-. CHARG[-1]:-. COVAL[-1]:-. HBOND[-1]:-. HPHOB[-1]:-. IONIZ[-1]:-. NITRO[-1]:1. POLAR[-1]:-. POSNG[-1]:0. SMALL[-1]:+. SULPH[-1]:-. TEENY[-1]:+. CRING[-1]:-. VALEN[-1]:2. AMINO[0]:R. ALIPH[0]:+. AROMA[0]:-. CBETA[0]:-. CHARG[0]:+. COVAL[0]:-. HBOND[0]:+. HPHOB[0]:-. IONIZ[0]:+. NITRO[0]:4. POLAR[0]:+. POSNG[0]:+. SMALL[0]:-. SULPH[0]:-. TEENY[0]:-. CRING[0]:-. VALEN[0]:3. AMINO[1]:H. ALIPH[1]:+. AROMA[1]:+. CBETA[1]:-. CHARG[1]:+. COVAL[1]:-. HBOND[1]:+. HPHOB[1]:-. IONIZ[1]:+. NITRO[1]:3. POLAR[1]:+. POSNG[1]:+. SMALL[1]:-. SULPH[1]:-. TEENY[1]:-. CRING[1]:+. VALEN[1]:3. AMINO[2]:Q. ALIPH[2]:+. AROMA[2]:-. CBETA[2]:-. CHARG[2]:-. COVAL[2]:-. HBOND[2]:+. HPHOB[2]:-. IONIZ[2]:-. NITRO[2]:2. POLAR[2]:+. POSNG[2]:0. SMALL[2]:-. SULPH[2]:-. TEENY[2]:-. CRING[2]:-. VALEN[2]:2. AMINO[3]:Q. ALIPH[3]:+. AROMA[3]:-. CBETA[3]:-. CHARG[3]:-. COVAL[3]:-. HBOND[3]:+. HPHOB[3]:-. IONIZ[3]:-. NITRO[3]:2. POLAR[3]:+. POSNG[3]:0. SMALL[3]:-. SULPH[3]:-. TEENY[3]:-. CRING[3]:-. VALEN[3]:2. AMINO[4]:R. ALIPH[4]:+. AROMA[4]:-. CBETA[4]:-. CHARG[4]:+. COVAL[4]:-. HBOND[4]:+. HPHOB[4]:-. IONIZ[4]:+. NITRO[4]:4. POLAR[4]:+. POSNG[4]:+. SMALL[4]:-. SULPH[4]:-. TEENY[4]:-. CRING[4]:-. VALEN[4]:3. AMINO[5]:Q. ALIPH[5]:+. AROMA[5]:-. CBETA[5]:-. CHARG[5]:-. COVAL[5]:-. HBOND[5]:+. HPHOB[5]:-. IONIZ[5]:-. NITRO[5]:2. POLAR[5]:+. POSNG[5]:0. SMALL[5]:-. SULPH[5]:-. TEENY[5]:-. CRING[5]:-. VALEN[5]:2. AMINO[6]:Q. ALIPH[6]:+. AROMA[6]:-. CBETA[6]:-. CHARG[6]:-. COVAL[6]:-. HBOND[6]:+. HPHOB[6]:-. IONIZ[6]:-. NITRO[6]:2. POLAR[6]:+. POSNG[6]:0. SMALL[6]:-. SULPH[6]:-. TEENY[6]:-. CRING[6]:-. VALEN[6]:2. AMINO[7]:Q. ALIPH[7]:+. AROMA[7]:-. CBETA[7]:-. CHARG[7]:-. COVAL[7]:-. HBOND[7]:+. HPHOB[7]:-. IONIZ[7]:-. NITRO[7]:2. POLAR[7]:+. POSNG[7]:0. SMALL[7]:-. SULPH[7]:-. TEENY[7]:-. CRING[7]:-. VALEN[7]:2. AMINO[8]:Q. ALIPH[8]:+. AROMA[8]:-. CBETA[8]:-. CHARG[8]:-. COVAL[8]:-. HBOND[8]:+. HPHOB[8]:-. IONIZ[8]:-. NITRO[8]:2. POLAR[8]:+. POSNG[8]:0. SMALL[8]:-. SULPH[8]:-. TEENY[8]:-. CRING[8]:-. VALEN[8]:2. MULT3:7. MULT5:4. MULT7:3. MULT9:2. 2GRAM:IA. GRAM2:HQ. 3GRAM:CIA. GRAM3:HQQ. Bioinformatics Tony C Smith
Artificial Intelligence • Computers do things only human brains can otherwise do expert expert Bioinformatics Tony C Smith
Artificial Intelligence • Computers do things only human brains can otherwise do expert system expert Bioinformatics Tony C Smith
Artificial Intelligence • Computers do things only human brains can otherwise do expert system learning system Bioinformatics Tony C Smith
Machine learning • creating computer programs that get better with experience • learn how to make expert judgments • discover previously hidden, potentially useful information (data mining) What is machine learning? How does it work? • user provides learning system with examples of concept to be learned • induction algorithm infers a characteristic model of the examples • model is used to predict whether or not future novel instances are also examples – and it does this very consistently, and very, very quickly! Bioinformatics Tony C Smith
Bioinformatics • Biologists know proteins, computer scientists know machine learning • Together, they can find hidden and potentially useful information about genes and proteins • Biotechnology is a multi-billion dollar industry • Biotechnology is one of the best funded areas of scientific research • Shortage of people educated in bioinformatics Bioinformatics Tony C Smith
The University of Waikato • Waikato University is ranked first in the country in computer science and in molecular, cellular, and whole-organism biology • centre of the universe for machine learning Bioinformatics Tony C Smith
The University of Waikato If you’re interested in getting involved in bioinformatics, or indeed any other area along the leading edge of computer science and/or biology, then … Waikato wants You! Bioinformatics Tony C Smith