Managing Gene Annotation Information the search is over … one problem solved … another begins

Managing GeneAnnotation Informationthe search is over… one problem solved… another begins observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group

Interdisciplinary Center for Biotechnology Research • Established at the University of Florida in 1987 by the Florida Legislature • centralized organization of biomedical core facilities • supporting biotechnology-based research • How did information management become my problem?

1998 GSAC Miami Beach

Why should I care about this problem? • Because my paycheck depends on it. • Avoid fatal failure in the funding loop. PI has $ for large gene-based project Other PI’s think this looks like a good idea PI applies for new funding Core Lab generates data Downstream data management & analysis PI writes papers, gives talks

From Sequence to Function • The genomic sequence identifies the 'parts' • the next trick is understanding gene function • Post genomic era = functional genomics • Critical concept: genes of similar sequence may have similar functions • Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function

BLAST • Most common starting point for gene identification • Similarity search of sequence repository (GenBank) • Output • Calculated scores (bit score and e-value) • Text string (definition line), ID Reference Tag • Sequence alignment • Advantages • Fast algorithm, very good at finding close homologs • Disadvantages • Not good at finding distant relatives • Cluster and Grid-enabled versions available

HMMER • HMMER developed by Sean Eddy • Uses Hidden Markov Models • Searches unknown protein query sequence against a database of protein family models • Statistical models constructed from alignment of conserved protein regions (Pfam) • Advantages • Superior to BLAST for discovering more distant homology relations • Disadvantages • More computationally intensive than BLAST • GRID enabled

OK! Great! Sequencing done. Homology searches complete.But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?

Search for summarizing information that restores sanity CTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG

BlastQuest A small idea with a big mission

BlastQuest Requirements • Accessible to research groups at remote locations • Privacy constrained sharing of results among the scientists • Selective browsing of BLAST homology search results • Selective data filtering on statistical criteria • e-value or bit score • Selective data grouping on criteria such as GI number, or a defined number of top-scoring results • Ad hoc search capability onuser determined criteria: • text terms • boolean logic From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.

Overview of BlastQuest Architecture

Welcome to BlastQuest

Choose among client projects

Results Selection

Grouped Results

Ad Hoc Text Searching

Internal BLAST Searches

Viewing a Gene Ontology Tree

KEGG Classification • Kyoto Encyclopedia of Genes and Genomes • “Wiring diagrams of life” • KEGG Protein Networks • Metabolic pathways • Regulatory pathways • Molecular complexes • Network-network relations • Network-environment relations

Common to both Unique to non-Unigene Unique to Unigene

Bacterial Genome Annotation Workbench Another simple idea driven by necessity

Start

Project Summary

Contig Browser

Contig summary

Physical map linked to annotation

Simple problems.Simple solutions.Why are these simple ideas important?

Human Genome Project • HGP drove innovation in biotechnology • 2 major technological benefits • stimulated development of high throughputmethods • reliance on computational tools for data mining and visualization of biological information

The HGP and the cost of DNA sequencing • “finished” quality DNA sequence • a DNA base call is considered finished if the probability of base call error is less than 1 in 10,000 • also known as phred > 40 • contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage • 1985: $10 per finished base • 2001: $1 per 10 finished bases

Genbank August 22, 2005 Public Collections of DNA and RNA Sequence Reach 100 Gigabases

Trends in the cost efficiency of DNA sequencing§ §Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335

454 Life Sciences Corporation The first commercial, massively parallel, DNA sequencing technology

454 Technology • Cyclic-array sequencing on in vitro amplified DNA molecules • individual molecules must be amplified to give a detectable sequencing signal • Instead of biological cloning, we amplify individual DNA fragments on solid state beads using PCR • Instead of terminator-based sequencing, pyrosequencing used to determine nucleotide order • “sequencing by synthesis”

454 Process Overview

The bottom line … • efficiency of DNA sequencing increased 100X • cost per finished base declined 10- to 30-fold … so what happens next? • The “democratization” of large-scale genomic biology • Many projects are now possible that were once fiscally inviable • We must deal with basic local data management and information issues or lose this opportunity

If you thought bioinformatics was important before By terminator-based sequencing we @ UF produce 60-70 Mbp per year By synthesis-based sequencing we produce 60-70 Mbp per day

Managing Gene Annotation Information the search is over … one problem solved … another begins

Managing Gene Annotation Information the search is over … one problem solved … another begins

Presentation Transcript

Artificial Intelligence Chapter 4: Informed Search and Exploration

Managing Subcontract Management Plans

The Problem of Detecting Differentially Expressed Genes

“The Story of an Hour” by Kate Chopin

Managing Decision Making and Problem Solving

Managing Misbehavior

Gene therapy for cystic fibrosis

Understanding and Managing Cascades on Large Graphs

The problem or conflict in the story begins when-

Leadership

Unit 3 Information pathways of gene

Best Practices for Search

Gene Expression: From Gene to Protein

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Managing STEMI Mimics in the Prehospital Environment

Semantic Web

Understanding and Managing Cascades on Large Graphs

CHAPTER 2 PROBLEM SOLVING

CS598Visual Information retrieval

Gene Expression: From Gene to Protein

Mid-term Review Chapters 2-7

Managing Information Extraction SIGMOD 2006 Tutorial