250 likes | 417 Views
Using the AT Grid for Genomics Research at the University of Florida. big biology meets obvious opportunity. Interdisciplinary Center for Biotechnology Research. Established at the University of Florida in 1987 by the Florida Legislature centralized organization of biomedical core facilities
E N D
Using the AT Grid for Genomics Research at the University of Florida big biology meets obvious opportunity
Interdisciplinary Center for Biotechnology Research • Established at the University of Florida in 1987 by the Florida Legislature • centralized organization of biomedical core facilities • supporting biotechnology-based research • Organized under the office of the Vice President of Research • Genomics group founded in 1998 to begin providing large-scale DNA sequencing services
Bill Farmerie Scientific Director ICBR Genomics Group Mick Popp, David Moraga, Sharon Norton, Li Zhang Gene Expression Core Li Liu, Fahong Yu, Brian Dill Bioinformatics Core Ernie Almira, Regina Shaw, Neda Panayotova, Kevin Holland, Patrick Thimote, Tina Langaee, Stephen Marsh Genomics Core ICBR Genomics Group
Large-scale DNA sequencing projects Microarray gene expression analysis Bioinformatics Resource Faculty research programs On campus Satellite research facilities Other SUS Universities Biotech industry What we do
Exploring Genome Space Biology moves from a data-poor to a data-rich science
The Human Genome Project • HGP drives innovation • 2 major technological benefits • stimulated development of high throughputmethods • computational tools for data mining and visualization of biological information
Genbank August 15 2004 37,343,937 loci 41,808,045,653 bases 37,343,937 reported sequences
What is this curiousrelationship betweengenes and computation? It is all about information management
Bioinformatics the intersection of biology and information sciences
InformationalMacromolecules Living things store, search, and selectively retrieve biological information
DNA the structure A:T G:C
CTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGAGGAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAGGGTTGGTTAATCCGTGCATGTGAGCTCCTCAGGGTGGAATCCAGGAGGATCCACGAGGGTGAATTGGCGGCATTCTTGTCTTACGCCATCGCCTACCCCCAAAACTTCCTGTCTGTGATTGACAGCTACAGCGTAGGATGCGGTCTGTTGAACTTCTGCGCGGTGGCTCTGGCTCTCTGTGAACTGGGCTACAGGCCTGTGGGGGTGCGTTTGGACAGCGGTGACCTCTGCAGCCTGTCGGTGGATGTCCGCCAGGTCTTCAGACGCTGCAGCGAGCATTTCTCCGTCCCTGCCTTTGATTCGTTGATCATCGTCGGGACGAATAACATCTCAGAGAAAAGCTTGACGGAGCTCAGCCTGAAGGAGAACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTTTACAAGCTGGTGGAGGTGAGGGGGAGGCCCCGGATGAAGATCAGCGAGGATCCGGAAAAGAGCACCGTTCCCGGGAGGAAGCAGGTGTACCGCCTGATGGACACTGATGCTCCTCCAGAACCTGGAGTCCCTCTGAGCTGCTTCCCTCTGTGCTCCGATCGCTCCTCCGTCTCCGTCACCCCGGCGCAGGTTCACCGTCTGCGGCAGGAAGTCTTTGTTGATGGACAGGTCACAGCCCGTCTGTGCAGCGCCACAGAGACCAGAACGGAGGTCCAGACCGCTCTCAAGACCCTCCACCCTCGACACCAGAGGCTGCAGGAGCCAGACTCGTACACGGTGATTCACATTCTGAAGAAAACAACATTGGATCGCGCTTTTCCGCTCTCTTCCCTTAGTTTCCCCTCCGAACTCCGCCGCTGGGCCGGAGGACTGAACCGGCCCCCGACGGTGTCCCAGCGGCGGTGCAATGTGGCCCGGGTCCGGGAGGAGTGCGTGACGCCAGAGCAGAATGGTTCGGTGGACGGGGGCGCACACGCTTCTCGCCGCGGCCGCTCCCCGCGGCCCACGGAACCGCGGGATCGGAGCTGTTTTGTGCCGCCTGAAGGACTCGAAGGGGGACGGATAAATGCTGGATCCCCGAGTCCAGATCTGACCGTCTGCATTCCGCTGGTGAGCTGCCAGACGCATCTGGAAACGAGCGCCGACAGAAGCAGCTCCGGACCATGTCGCCGTCCGCGCACACAGGTCGCGTGTAAAGGGGACTTGGTCAGATCATCTTGCACCGGAACCAGGTCTCCCCTGGAGATGGGGACGGTCATGACCGTCTTCTACCAGAAGAAGTCCCAGCGGCCGGAGAGGAGAACCTTCCAGATCAAGCCTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGGGCCGTGCCGGGACCGGGCCCACGCCGCCCAGAACCTCATGTTCCTGGTGTTCCAGCACCGACCGGCCAGTTCTGGCTCAGCTCCACACAACATCTGACAAACCCTCGTGGTTCCTGGTGGTCGACCACACGGCTGGTGAGGCGGCCTCAGGTAGCTCAGGTAGCTCAGGTTAGCGTAAAGGGAGTTTTAAGCATCACCTGGTGACGGGGCAGGTGAGCTCCAGCCACTCAGCAGTGCACGGCCGTGCACATACACACACACCTCTGTGTCGAGGTTACAGGTGGGGCCAAAGCCCAACACCTTCAATGGCCCTCAGAGCTTTGAGGTTTTGAGGAATTGAGCCTTTAATCAGAAAACTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGAGGAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAGGGTTGGTTAATCCGTGCATGTGAGCTCCTCAGGGTGGAATCCAGGAGGATCCACGAGGGTGAATTGGCGGCATTCTTGTCTTACGCCATCGCCTACCCCCAAAACTTCCTGTCTGTGATTGACAGCTACAGCGTAGGATGCGGTCTGTTGAACTTCTGCGCGGTGGCTCTGGCTCTCTGTGAACTGGGCTACAGGCCTGTGGGGGTGCGTTTGGACAGCGGTGACCTCTGCAGCCTGTCGGTGGATGTCCGCCAGGTCTTCAGACGCTGCAGCGAGCATTTCTCCGTCCCTGCCTTTGATTCGTTGATCATCGTCGGGACGAATAACATCTCAGAGAAAAGCTTGACGGAGCTCAGCCTGAAGGAGAACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTTTACAAGCTGGTGGAGGTGAGGGGGAGGCCCCGGATGAAGATCAGCGAGGATCCGGAAAAGAGCACCGTTCCCGGGAGGAAGCAGGTGTACCGCCTGATGGACACTGATGCTCCTCCAGAACCTGGAGTCCCTCTGAGCTGCTTCCCTCTGTGCTCCGATCGCTCCTCCGTCTCCGTCACCCCGGCGCAGGTTCACCGTCTGCGGCAGGAAGTCTTTGTTGATGGACAGGTCACAGCCCGTCTGTGCAGCGCCACAGAGACCAGAACGGAGGTCCAGACCGCTCTCAAGACCCTCCACCCTCGACACCAGAGGCTGCAGGAGCCAGACTCGTACACGGTGATTCACATTCTGAAGAAAACAACATTGGATCGCGCTTTTCCGCTCTCTTCCCTTAGTTTCCCCTCCGAACTCCGCCGCTGGGCCGGAGGACTGAACCGGCCCCCGACGGTGTCCCAGCGGCGGTGCAATGTGGCCCGGGTCCGGGAGGAGTGCGTGACGCCAGAGCAGAATGGTTCGGTGGACGGGGGCGCACACGCTTCTCGCCGCGGCCGCTCCCCGCGGCCCACGGAACCGCGGGATCGGAGCTGTTTTGTGCCGCCTGAAGGACTCGAAGGGGGACGGATAAATGCTGGATCCCCGAGTCCAGATCTGACCGTCTGCATTCCGCTGGTGAGCTGCCAGACGCATCTGGAAACGAGCGCCGACAGAAGCAGCTCCGGACCATGTCGCCGTCCGCGCACACAGGTCGCGTGTAAAGGGGACTTGGTCAGATCATCTTGCACCGGAACCAGGTCTCCCCTGGAGATGGGGACGGTCATGACCGTCTTCTACCAGAAGAAGTCCCAGCGGCCGGAGAGGAGAACCTTCCAGATCAAGCCTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGGGCCGTGCCGGGACCGGGCCCACGCCGCCCAGAACCTCATGTTCCTGGTGTTCCAGCACCGACCGGCCAGTTCTGGCTCAGCTCCACACAACATCTGACAAACCCTCGTGGTTCCTGGTGGTCGACCACACGGCTGGTGAGGCGGCCTCAGGTAGCTCAGGTAGCTCAGGTTAGCGTAAAGGGAGTTTTAAGCATCACCTGGTGACGGGGCAGGTGAGCTCCAGCCACTCAGCAGTGCACGGCCGTGCACATACACACACACCTCTGTGTCGAGGTTACAGGTGGGGCCAAAGCCCAACACCTTCAATGGCCCTCAGAGCTTTGAGGTTTTGAGGAATTGAGCCTTTAATCAGAAAA another simple linear code is the basis of life
Biological Information function information storage selective information retrieval
From Sequence to Function • The genomic sequence identifies the 'parts' • the next trick is understanding gene function • Post genomic era = functional genomics • Critical concept: genes of similar sequence may have similar functions • Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function
BLAST • Most common starting point for gene identification • Input: nucleotide or amino acid sequence • Similarity search of sequence repository (GenBank) • Output • Calculated scores (bit score and e-value) • Text string (definition line), ID Reference Tag • Sequence alignment • Advantages • Fast algorithm, very good at finding close homologs • Disadvantages • Only finds genes existing in the search database • Not good at finding distant relatives
Alternatives to BLAST • HMMER developed by Sean Eddy • Uses Hidden Markov Models • Searches unknown protein query sequence against a database of protein family models • Statistical models constructed from alignment of conserved protein regions (Pfam) • 7677 families in Pfam release 16.0 • Advantages • Superior to BLAST for discovering more distant homology relations • Disadvantages • More computationally intensive than BLAST
AT Grid: http://at.ufl.edu/grid/ • Office of Academic Technology • Fedro Zazueta Director • Mike Kutyna Project Manager • Links 500 desktop PCs using United Devices GridMP 4.2 software • HMMER • BLAST
HMMER Query Pfam DB UD HMMER implementation
Sample HMMER Output Query: 7.UF_CU.3.CB366 804 1 630 nseq=40; translated Scores for sequence family classification (score includes all domains): Model Description Score E-value N -------- ----------- ----- ------- --- KH_1 KH domain 127.6 2.3e-34 2 KH_2 KH domain 3.0 0.88 2 Parsed for domains: Model Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- KH_2 1/2 36 93 .. 1 78 [] 3.4 0.8 KH_1 1/2 36 98 .. 1 74 [] 66.9 4.4e-16 KH_2 2/2 118 160 .. 1 78 [] 0.1 1.8 KH_1 2/2 118 180 .. 1 74 [] 68.3 1.6e-16 Alignments of top-scoring domains: KH_2: domain 1 of 2, from 36 to 93: score 3.4, E = 0.8 *->avivvirtsrpGivIGKgGsnIkklgkelrklltgkkvqieviEySd i + ++ G +IG+gGs I+ l+++ + ++i E 7.UF_CU.3. 36 SDIMMVESANVGKIIGRGGSKIRDLEQDSNAR-----IKISRDE--- 74 eeFgkkVfLeLwVKVkknWvknpellaqLga<-* + + + v+ ++ ++ a 7.UF_CU.3. 75 ---DENGMKS---------VEISGTDEEIDA 93 KH_1: domain 1 of 2, from 36 to 98: score 66.9, E = 4.4e-16 *->terilippskvgriIGkgGstIkeIreetGakIdipddgseskplpe + + + + vg+iIG+gGs+I+ ++++++a+I+i++d+ 7.UF_CU.3. 36 SDIMMVESANVGKIIGRGGSKIRDLEQDSNARIKISRDE-------- 74 dplngsdertvtIsGtpeavekAkkli<-* + +++ + v+IsGt e++++Ak++i 7.UF_CU.3. 75 -D--ENGMKSVEISGTDEEIDAAKRMI 98
Good news -- Bad news • AT Grid compresses time for HMMER searches ~100X • Accepts batch queries as input • Query sequences must be pre-computed protein translation • Requires additional step & CDS prediction • Output is flat text • Developed Perl-based parser • Tabbed output as input to relational DB • Integrating gene annotation data from as many applications as possible • Facilitates comparison of results for the same query