410 likes | 511 Views
Managing Gene Annotation Information the search is over … one problem solved … another begins. observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group. Interdisciplinary Center for Biotechnology Research.
E N D
Managing GeneAnnotation Informationthe search is over… one problem solved… another begins observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group
Interdisciplinary Center for Biotechnology Research • Established at the University of Florida in 1987 by the Florida Legislature • centralized organization of biomedical core facilities • supporting biotechnology-based research • How did information management become my problem?
Why should I care about this problem? • Because my paycheck depends on it. • Avoid fatal failure in the funding loop. PI has $ for large gene-based project Other PI’s think this looks like a good idea PI applies for new funding Core Lab generates data Downstream data management & analysis PI writes papers, gives talks
From Sequence to Function • The genomic sequence identifies the 'parts' • the next trick is understanding gene function • Post genomic era = functional genomics • Critical concept: genes of similar sequence may have similar functions • Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function
BLAST • Most common starting point for gene identification • Similarity search of sequence repository (GenBank) • Output • Calculated scores (bit score and e-value) • Text string (definition line), ID Reference Tag • Sequence alignment • Advantages • Fast algorithm, very good at finding close homologs • Disadvantages • Not good at finding distant relatives • Cluster and Grid-enabled versions available
HMMER • HMMER developed by Sean Eddy • Uses Hidden Markov Models • Searches unknown protein query sequence against a database of protein family models • Statistical models constructed from alignment of conserved protein regions (Pfam) • Advantages • Superior to BLAST for discovering more distant homology relations • Disadvantages • More computationally intensive than BLAST • GRID enabled
OK! Great! Sequencing done. Homology searches complete.But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?
Search for summarizing information that restores sanity CTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG
BlastQuest A small idea with a big mission
BlastQuest Requirements • Accessible to research groups at remote locations • Privacy constrained sharing of results among the scientists • Selective browsing of BLAST homology search results • Selective data filtering on statistical criteria • e-value or bit score • Selective data grouping on criteria such as GI number, or a defined number of top-scoring results • Ad hoc search capability onuser determined criteria: • text terms • boolean logic From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.
KEGG Classification • Kyoto Encyclopedia of Genes and Genomes • “Wiring diagrams of life” • KEGG Protein Networks • Metabolic pathways • Regulatory pathways • Molecular complexes • Network-network relations • Network-environment relations
Common to both Unique to non-Unigene Unique to Unigene
Bacterial Genome Annotation Workbench Another simple idea driven by necessity
Simple problems.Simple solutions.Why are these simple ideas important?
Human Genome Project • HGP drove innovation in biotechnology • 2 major technological benefits • stimulated development of high throughputmethods • reliance on computational tools for data mining and visualization of biological information
The HGP and the cost of DNA sequencing • “finished” quality DNA sequence • a DNA base call is considered finished if the probability of base call error is less than 1 in 10,000 • also known as phred > 40 • contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage • 1985: $10 per finished base • 2001: $1 per 10 finished bases
Genbank August 22, 2005 Public Collections of DNA and RNA Sequence Reach 100 Gigabases
Trends in the cost efficiency of DNA sequencing§ §Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335
454 Life Sciences Corporation The first commercial, massively parallel, DNA sequencing technology
454 Technology • Cyclic-array sequencing on in vitro amplified DNA molecules • individual molecules must be amplified to give a detectable sequencing signal • Instead of biological cloning, we amplify individual DNA fragments on solid state beads using PCR • Instead of terminator-based sequencing, pyrosequencing used to determine nucleotide order • “sequencing by synthesis”
The bottom line … • efficiency of DNA sequencing increased 100X • cost per finished base declined 10- to 30-fold … so what happens next? • The “democratization” of large-scale genomic biology • Many projects are now possible that were once fiscally inviable • We must deal with basic local data management and information issues or lose this opportunity
If you thought bioinformatics was important before By terminator-based sequencing we @ UF produce 60-70 Mbp per year By synthesis-based sequencing we produce 60-70 Mbp per day