1 / 25

Using the AT Grid for Genomics Research at the University of Florida

Using the AT Grid for Genomics Research at the University of Florida. big biology meets obvious opportunity. Interdisciplinary Center for Biotechnology Research. Established at the University of Florida in 1987 by the Florida Legislature centralized organization of biomedical core facilities

afya
Download Presentation

Using the AT Grid for Genomics Research at the University of Florida

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the AT Grid for Genomics Research at the University of Florida big biology meets obvious opportunity

  2. Interdisciplinary Center for Biotechnology Research • Established at the University of Florida in 1987 by the Florida Legislature • centralized organization of biomedical core facilities • supporting biotechnology-based research • Organized under the office of the Vice President of Research • Genomics group founded in 1998 to begin providing large-scale DNA sequencing services

  3. Bill Farmerie Scientific Director ICBR Genomics Group Mick Popp, David Moraga, Sharon Norton, Li Zhang Gene Expression Core Li Liu, Fahong Yu, Brian Dill Bioinformatics Core Ernie Almira, Regina Shaw, Neda Panayotova, Kevin Holland, Patrick Thimote, Tina Langaee, Stephen Marsh Genomics Core ICBR Genomics Group

  4. Large-scale DNA sequencing projects Microarray gene expression analysis Bioinformatics Resource Faculty research programs On campus Satellite research facilities Other SUS Universities Biotech industry What we do

  5. Exploring Genome Space Biology moves from a data-poor to a data-rich science

  6. The Genome Covers

  7. The Human Genome Project • HGP drives innovation • 2 major technological benefits • stimulated development of high throughputmethods • computational tools for data mining and visualization of biological information

  8. Genbank August 15 2004 37,343,937 loci 41,808,045,653 bases 37,343,937 reported sequences

  9. Growth of DNASequenceOutput

  10. What is this curiousrelationship betweengenes and computation? It is all about information management

  11. Bioinformatics the intersection of biology and information sciences

  12. InformationalMacromolecules Living things store, search, and selectively retrieve biological information

  13. DNA the structure A:T G:C

  14. DNA as a linear array

  15. we store lots of information using simple linear codes

  16. CTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGAGGAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAGGGTTGGTTAATCCGTGCATGTGAGCTCCTCAGGGTGGAATCCAGGAGGATCCACGAGGGTGAATTGGCGGCATTCTTGTCTTACGCCATCGCCTACCCCCAAAACTTCCTGTCTGTGATTGACAGCTACAGCGTAGGATGCGGTCTGTTGAACTTCTGCGCGGTGGCTCTGGCTCTCTGTGAACTGGGCTACAGGCCTGTGGGGGTGCGTTTGGACAGCGGTGACCTCTGCAGCCTGTCGGTGGATGTCCGCCAGGTCTTCAGACGCTGCAGCGAGCATTTCTCCGTCCCTGCCTTTGATTCGTTGATCATCGTCGGGACGAATAACATCTCAGAGAAAAGCTTGACGGAGCTCAGCCTGAAGGAGAACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTTTACAAGCTGGTGGAGGTGAGGGGGAGGCCCCGGATGAAGATCAGCGAGGATCCGGAAAAGAGCACCGTTCCCGGGAGGAAGCAGGTGTACCGCCTGATGGACACTGATGCTCCTCCAGAACCTGGAGTCCCTCTGAGCTGCTTCCCTCTGTGCTCCGATCGCTCCTCCGTCTCCGTCACCCCGGCGCAGGTTCACCGTCTGCGGCAGGAAGTCTTTGTTGATGGACAGGTCACAGCCCGTCTGTGCAGCGCCACAGAGACCAGAACGGAGGTCCAGACCGCTCTCAAGACCCTCCACCCTCGACACCAGAGGCTGCAGGAGCCAGACTCGTACACGGTGATTCACATTCTGAAGAAAACAACATTGGATCGCGCTTTTCCGCTCTCTTCCCTTAGTTTCCCCTCCGAACTCCGCCGCTGGGCCGGAGGACTGAACCGGCCCCCGACGGTGTCCCAGCGGCGGTGCAATGTGGCCCGGGTCCGGGAGGAGTGCGTGACGCCAGAGCAGAATGGTTCGGTGGACGGGGGCGCACACGCTTCTCGCCGCGGCCGCTCCCCGCGGCCCACGGAACCGCGGGATCGGAGCTGTTTTGTGCCGCCTGAAGGACTCGAAGGGGGACGGATAAATGCTGGATCCCCGAGTCCAGATCTGACCGTCTGCATTCCGCTGGTGAGCTGCCAGACGCATCTGGAAACGAGCGCCGACAGAAGCAGCTCCGGACCATGTCGCCGTCCGCGCACACAGGTCGCGTGTAAAGGGGACTTGGTCAGATCATCTTGCACCGGAACCAGGTCTCCCCTGGAGATGGGGACGGTCATGACCGTCTTCTACCAGAAGAAGTCCCAGCGGCCGGAGAGGAGAACCTTCCAGATCAAGCCTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGGGCCGTGCCGGGACCGGGCCCACGCCGCCCAGAACCTCATGTTCCTGGTGTTCCAGCACCGACCGGCCAGTTCTGGCTCAGCTCCACACAACATCTGACAAACCCTCGTGGTTCCTGGTGGTCGACCACACGGCTGGTGAGGCGGCCTCAGGTAGCTCAGGTAGCTCAGGTTAGCGTAAAGGGAGTTTTAAGCATCACCTGGTGACGGGGCAGGTGAGCTCCAGCCACTCAGCAGTGCACGGCCGTGCACATACACACACACCTCTGTGTCGAGGTTACAGGTGGGGCCAAAGCCCAACACCTTCAATGGCCCTCAGAGCTTTGAGGTTTTGAGGAATTGAGCCTTTAATCAGAAAACTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGAGGAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAGGGTTGGTTAATCCGTGCATGTGAGCTCCTCAGGGTGGAATCCAGGAGGATCCACGAGGGTGAATTGGCGGCATTCTTGTCTTACGCCATCGCCTACCCCCAAAACTTCCTGTCTGTGATTGACAGCTACAGCGTAGGATGCGGTCTGTTGAACTTCTGCGCGGTGGCTCTGGCTCTCTGTGAACTGGGCTACAGGCCTGTGGGGGTGCGTTTGGACAGCGGTGACCTCTGCAGCCTGTCGGTGGATGTCCGCCAGGTCTTCAGACGCTGCAGCGAGCATTTCTCCGTCCCTGCCTTTGATTCGTTGATCATCGTCGGGACGAATAACATCTCAGAGAAAAGCTTGACGGAGCTCAGCCTGAAGGAGAACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTTTACAAGCTGGTGGAGGTGAGGGGGAGGCCCCGGATGAAGATCAGCGAGGATCCGGAAAAGAGCACCGTTCCCGGGAGGAAGCAGGTGTACCGCCTGATGGACACTGATGCTCCTCCAGAACCTGGAGTCCCTCTGAGCTGCTTCCCTCTGTGCTCCGATCGCTCCTCCGTCTCCGTCACCCCGGCGCAGGTTCACCGTCTGCGGCAGGAAGTCTTTGTTGATGGACAGGTCACAGCCCGTCTGTGCAGCGCCACAGAGACCAGAACGGAGGTCCAGACCGCTCTCAAGACCCTCCACCCTCGACACCAGAGGCTGCAGGAGCCAGACTCGTACACGGTGATTCACATTCTGAAGAAAACAACATTGGATCGCGCTTTTCCGCTCTCTTCCCTTAGTTTCCCCTCCGAACTCCGCCGCTGGGCCGGAGGACTGAACCGGCCCCCGACGGTGTCCCAGCGGCGGTGCAATGTGGCCCGGGTCCGGGAGGAGTGCGTGACGCCAGAGCAGAATGGTTCGGTGGACGGGGGCGCACACGCTTCTCGCCGCGGCCGCTCCCCGCGGCCCACGGAACCGCGGGATCGGAGCTGTTTTGTGCCGCCTGAAGGACTCGAAGGGGGACGGATAAATGCTGGATCCCCGAGTCCAGATCTGACCGTCTGCATTCCGCTGGTGAGCTGCCAGACGCATCTGGAAACGAGCGCCGACAGAAGCAGCTCCGGACCATGTCGCCGTCCGCGCACACAGGTCGCGTGTAAAGGGGACTTGGTCAGATCATCTTGCACCGGAACCAGGTCTCCCCTGGAGATGGGGACGGTCATGACCGTCTTCTACCAGAAGAAGTCCCAGCGGCCGGAGAGGAGAACCTTCCAGATCAAGCCTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGGGCCGTGCCGGGACCGGGCCCACGCCGCCCAGAACCTCATGTTCCTGGTGTTCCAGCACCGACCGGCCAGTTCTGGCTCAGCTCCACACAACATCTGACAAACCCTCGTGGTTCCTGGTGGTCGACCACACGGCTGGTGAGGCGGCCTCAGGTAGCTCAGGTAGCTCAGGTTAGCGTAAAGGGAGTTTTAAGCATCACCTGGTGACGGGGCAGGTGAGCTCCAGCCACTCAGCAGTGCACGGCCGTGCACATACACACACACCTCTGTGTCGAGGTTACAGGTGGGGCCAAAGCCCAACACCTTCAATGGCCCTCAGAGCTTTGAGGTTTTGAGGAATTGAGCCTTTAATCAGAAAA another simple linear code is the basis of life

  17. Biological Information function information storage selective information retrieval

  18. From Sequence to Function • The genomic sequence identifies the 'parts' • the next trick is understanding gene function • Post genomic era = functional genomics • Critical concept: genes of similar sequence may have similar functions • Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function

  19. BLAST • Most common starting point for gene identification • Input: nucleotide or amino acid sequence • Similarity search of sequence repository (GenBank) • Output • Calculated scores (bit score and e-value) • Text string (definition line), ID Reference Tag • Sequence alignment • Advantages • Fast algorithm, very good at finding close homologs • Disadvantages • Only finds genes existing in the search database • Not good at finding distant relatives

  20. Alternatives to BLAST • HMMER developed by Sean Eddy • Uses Hidden Markov Models • Searches unknown protein query sequence against a database of protein family models • Statistical models constructed from alignment of conserved protein regions (Pfam) • 7677 families in Pfam release 16.0 • Advantages • Superior to BLAST for discovering more distant homology relations • Disadvantages • More computationally intensive than BLAST

  21. This could be a super computer

  22. AT Grid: http://at.ufl.edu/grid/ • Office of Academic Technology • Fedro Zazueta Director • Mike Kutyna Project Manager • Links 500 desktop PCs using United Devices GridMP 4.2 software • HMMER • BLAST

  23. HMMER Query Pfam DB UD HMMER implementation

  24. Sample HMMER Output Query: 7.UF_CU.3.CB366 804 1 630 nseq=40; translated Scores for sequence family classification (score includes all domains): Model Description Score E-value N -------- ----------- ----- ------- --- KH_1 KH domain 127.6 2.3e-34 2 KH_2 KH domain 3.0 0.88 2 Parsed for domains: Model Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- KH_2 1/2 36 93 .. 1 78 [] 3.4 0.8 KH_1 1/2 36 98 .. 1 74 [] 66.9 4.4e-16 KH_2 2/2 118 160 .. 1 78 [] 0.1 1.8 KH_1 2/2 118 180 .. 1 74 [] 68.3 1.6e-16 Alignments of top-scoring domains: KH_2: domain 1 of 2, from 36 to 93: score 3.4, E = 0.8 *->avivvirtsrpGivIGKgGsnIkklgkelrklltgkkvqieviEySd i + ++ G +IG+gGs I+ l+++ + ++i E 7.UF_CU.3. 36 SDIMMVESANVGKIIGRGGSKIRDLEQDSNAR-----IKISRDE--- 74 eeFgkkVfLeLwVKVkknWvknpellaqLga<-* + + + v+ ++ ++ a 7.UF_CU.3. 75 ---DENGMKS---------VEISGTDEEIDA 93 KH_1: domain 1 of 2, from 36 to 98: score 66.9, E = 4.4e-16 *->terilippskvgriIGkgGstIkeIreetGakIdipddgseskplpe + + + + vg+iIG+gGs+I+ ++++++a+I+i++d+ 7.UF_CU.3. 36 SDIMMVESANVGKIIGRGGSKIRDLEQDSNARIKISRDE-------- 74 dplngsdertvtIsGtpeavekAkkli<-* + +++ + v+IsGt e++++Ak++i 7.UF_CU.3. 75 -D--ENGMKSVEISGTDEEIDAAKRMI 98

  25. Good news -- Bad news • AT Grid compresses time for HMMER searches ~100X • Accepts batch queries as input • Query sequences must be pre-computed protein translation • Requires additional step & CDS prediction • Output is flat text • Developed Perl-based parser • Tabbed output as input to relational DB • Integrating gene annotation data from as many applications as possible • Facilitates comparison of results for the same query

More Related