1 / 43

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery. 2003.8.27 Sangsoo Kim Nat’l Genome Informat’n Ct. Korea Res. Inst. of Biosci. & Biotech. Bio-Databases & Servers. Contents Bibliographic (Journal abstracts such as Medline) Experimental data (Sequences or structures)

moana
Download Presentation

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. APAN e-Science Workshope-Bio System for Bio-Knowledge Discovery 2003.8.27 Sangsoo Kim Nat’l Genome Informat’n Ct. Korea Res. Inst. of Biosci. & Biotech.

  2. Bio-Databases & Servers • Contents • Bibliographic (Journal abstracts such as Medline) • Experimental data (Sequences or structures) • Results from annotation and analyses • Bioinformatic analysis tools • Purpose • Storing & managing raw data • Querying for knowledge discovery • Sharing information with others • Serving others with online analysis

  3. New Role of Databases • New discoveries of biological knowledge are published in scientific journals • But journal space is limited and not suitable to publish large amount of high throughput data • The supplementary information is provided in an accompanying website • Readers can download the supplementary information and analyze from different aspect • Combination with other information may surprise with unexpected results • Journal publishers require supplementary information deposited in public archives

  4. Example - Nucleotide Sequence Repositories • Nucleotide sequences discovered by sequencing experiments are deposited in any one of the public archives and the journal paper list the accession numbers only (without deposition, you cannot publish sequence discovery in journals) • Public archives are • DDBJ operated by CIB, NIG in Japan • EMBL operated by EMBL-EBI in UK • GenBank operated by NCBI, NIH in USA • The contents of these archives are exchanged daily and freely accessible to everybody • Now extended to archive DNA chip data as well

  5. Growth of GenBankA Nucleotide Sequence Repository Human Genome Project

  6. Entrez: Home Page RTFM

  7. Entrez: Display GenBank as HTML FASTA as HTML

  8. Example – BLAST Servers • Originally developed to compare my sequence to those in the repository in order to check whether mine is novel or not • Extended to detect distantly related sequences, serving as the major sequence annotation tool • Servers accept various kinds of queries and return alignment results over WWW • The most widely used bioinformatic tool • For the analysis of many sequences, better to use local installation

  9. BLAST (Basic Local Alignment Sequence Tool) http://www.ncbi.nlm.nih.gov/BLAST program query database blastn dna dna blastp protein protein blastx dna (6x) protein tblastn protein dna (6x) tblastx dna (6x) dna (6x) RTFM

  10. BLASTN (Cont'd) Descriptions Alignments

  11. Example – Derived Databases • Swiss-Prot & PIR • Proteins are predicted from deposited nucleotide sequences, either being mRNA or genomic DNA • Functions and features of the protein is annotated manually by experts • Protein motifs • Prosite, pfam, BLOCKS, InterPro • Keyword querying and motif detection of user’s sequence • Gene Ontology • Hierarchical organization of biological terms • Cataloging associated gene products

  12. ExPASy (http://www.expasy.ch) Expert Protein Analysis System

  13. NiceProt View

  14. Gene Ontology • Systematic classification of biological terminology • Molecular function • Biological process • Cellular component • Controlled vocabulary • Associated GENE list

  15. Data Mining • Objective: • Discovery of (biological) knowledge by querying information in the databases and comprehending it • Problems: • Too many databases • Different protocols for access • Lack of standards • Poor quality or propagation of errors • Solutions: • Data warehousing or federated databases

  16. Catalog of Bio-DBs arranged by Data Domain

  17. Database of Databases • Data warehousing • Collect all databases by mirroring • Store in a unified format • Entrez (NCBI) or SRS (EBI) • Powerful but heavy maintenance load • Federated databases • Maintained by participating members • Accessed by common protocols • Bio-DAS or Web Services via SOAP/XML • Next generation technology, but dependent on both the cooperation by members and Internet bandwidth

  18. www.ngic.re.kr

  19. www.ncbi.nih.gov /LocusLink

  20. New Data Types • Textual • Nucleotide or amino acid sequences • Associated feature annotation • Bibliographical texts • Numeric • Gene expression profiles • Results from statistical analysis • Graphical • Protein-protein interaction network • Genetic network • Biochemical reaction pathways

  21. Building a Nation from a Land of City States Lincoln D. Stein Cold Spring Harbor Laboratory

  22. Italy in the Middle Ages

  23. Bioinformatics, ca. 2002 Bioinformatics In the XXI Century

  24. Making Easy Things Hard Give me all human sequences submitted to GenBank/EMBL last week.

  25. Lots of ways to do it • Download weekly update of GenBank/EMBL from FTP site • Use official network-based interfaces to data: • NCBI toolkit • EBI CORBA & XEMBL servers • Use friendly web interfaces at NCBI, EBI

  26. Perl/Java/Python to the Rescue • One script to do the web fetch • Another to parse the file format • A third to move into private database • A fourth to repeat this weekly • Result: • 6,719 scripts that do the same thing • None of them work together

  27. What’s Wrong with This? • My EMBL fetcher is poorly documented so you write your own • Your fetcher won’t work with my parser • My parser won’t work with your fetcher • We’ve now wasted 20 hours rather than 10 • Multiply this by 6,719

  28. What’s else is Wrong? • NCBI/EBI tweaks something • 6,719 scripts fail at once • 6,719 bioinformaticists tear their hair • 21,261 biologists curse the bioinformaticists • 6,719 bioinformaticists curse their own existence

  29. Unifying Bioinformatics Services MIMBD: Meetings on the Interconnection of Molecular Biology Databases Federated models: Gaea, Kleisli Data warehouses: GUS, MODs, Ensembl, UCSC Ad hoc web services Formal web services

  30. Ad hoc services BioXXX Conf file Your Script

  31. Formal Web Services GO Service SeqFetch Service BLAST Service BLAT Service SeqFetch Service Microarray Service

  32. Formal Web Services GO Service SeqFetch Service BLAST Service BLAT Service SeqFetch Service Service Registry Microarray Service

  33. Formal Web Services GO Service BLAST Service SeqFetch Service BLAT Service SeqFetch Service BioXXX Service Registry Microarray Service Microarray Service Your Script

  34. Technical Infrastructure is Here* • Common vocabulary: GO • Transport format: XML • Data definition language: XSD • Wire protocol: SOAP • Service definition language: WSDL • Service registry: UDDI *(almost)

  35. Annotation Server Reference Server Annotation Server Annotation Server AC003027 M10154 AC005122 WI1029 AFM820 AFM1126 WI443 Distributed Annotation Systemhttp://www.biodas.org AC003027 M10154 AC005122 Thursday 10:30 AM Canyon IV

  36. Europe, ca 2000

  37. Bioinformatics, ca 2010?

  38. Collection and Sharing of National Genome Information Industry Research Institutes Universities KNIH Human Microbial Proteome NGIC Plant Animal Crop Ag-Bio

  39. National Genome Information Network Data Grid Application Grid KISTI ETRI KNIH Human Microbial NGIC Proteome Plant Animal Crop Ag-Bio

More Related