1 / 58

Biological Information and Biological Databases

Biological Information and Biological Databases. Meena K Sakharkar Bioinformatics Centre National University of Singapore. Biological Information. Nature of Life Science Information. Descriptive Classification and Nomenclatural Observational and Phenomenological Experimental

tineo
Download Presentation

Biological Information and Biological Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore

  2. Biological Information

  3. Nature of Life Science Information • Descriptive • Classification and Nomenclatural • Observational and Phenomenological • Experimental • Deduced/Computed • Simulated? • Theoretical?

  4. Descriptive

  5. Classify and Give Names • Classification and Nomenclature • Linnaeus - binomial nomenclature • Group into kingdoms, phyla, classes, orders, families, genera, species, subspecies, strains, etc • Associate descriptions to these classification schema, and classify according to description etc

  6. Observational/Phenomenological • Like descriptive, yet more active • Observe a lot of biological phenomenon • Charles Darwin • Gregor Mendel to McClintock • Start to do some experiments

  7. Experimental • From dissections to complex genetic engineering experiments

  8. BioInformatics • Deduced/Computed • Simulated? • Theoretical?

  9. What is BioInformatics? • Many related terms and buzzwords • A multiplicity of names: • bioinformatics • biocomputing • biological computing • computational biology • computational genomics • biological data mining

  10. Overview of the challenges of Molecular Biology Computing • The huge dataset problem • automated DNA sequencers • the Human Genome Project • bulk sequencing of cDNAs (ESTs)

  11. Human Genome Project • What is the Human Genome Project? • 15-year effort formally begun in October 1990. coordinated by the U.S. Department of Energy and the National Institutes of Health. • identify all the estimated 80,000 genes in human DNA, • determine the sequences of the 3 billion chemical bases that make up human DNA, • store this information in databases, • develop tools for data analysis, and • address the ethical, legal, and social issues (ELSI) that may arise from the project.

  12. Who is head of the U.S. Human Genome Project? • The DOE Human Genome Program is directed by Ari Patrinos, and Francis Collins directs the NIH Human Genome Program. • Ari Patrinos also heads the Department of Energy Office of Biological and Environmental Research.

  13. What are the comparative genome sizes of humans and other organisms being studied? If compiled in books, the data would fill an estimated 200 volumes the size of a Manhattan telephone book (at 1000 pages each), and reading it would require 26 years working around the clock

  14. Informatics: Data Collection and Interpretation HUMANGENETIC DIVERSITY • The Ultimate Human Genetic Database • Any two individuals differ in about 3 x 106 bases (0.1%). • The population is now about 5 x 109. • A catalog of all sequence differences would require 15 x 1015 entries. • This catalog may be needed to find the rarest or most complex disease genes.

  15. Databases

  16. Basic Terminology What is a nucleotide/protein sequence database and databank? • Database is a collection of Nucleotide/protein sequence and their Associated annotations. • Databanks Groups which collect, compile, maintain and distribute the database.

  17. Fundamental Dogma

  18. Work from the Code of Life

  19. Deduced and Computed Information in the Era of Computational Biology

  20. Databases • What are the different kinds of databases and their formats? Nucleic Acid Sequence EMBL at EBI. GENBANK at NCBI. DDBJ at Japan. Protein Sequence SWISS PROT NBRF(PIR)

  21. Database • Protein structure databases PDB • Information on the structural data for the proteins/nucleic acids. • whose 3-D structure solved by X-ray crystallography/NMR • PDB database NRL 3D Database • NRL_3D is a sequence-structure database. • Can be used in conjunction with PIR. • PDB with PIR.

  22. GenBank Entry

  23. EMBL Entry

  24. SwissProt Entry

  25. Other databases • Genome Databases • GDB :Genome Data Bank • OMIM • Pattern Databases • Prosite • TFD

  26. Usage of databases • Annotation Searches - KW, Authors, Features. • What is the protein sequence for human insulin? • How does the 3D structure of calmodulin look like? • What is the genetic location of cystic fibrosis gene? • List all introns in rat? • Homology Searches • Is there any protein sequence that is similar to mine? • Is this gene known in any other species? • Has someone already cloned this sequence?

  27. Usage of databases • Pattern searches • Does my sequence contain any known motif (that can give me a clue about the function)? • Which known sequences contain this motif? • Is any part of my sequence recoganised by a transcription factor? • List all known start, splice and stop signals in my genomic sequence • Prediction - Use the database as knowledge database • What may the structure of my protein be? • Secondary structure prediction • Modeling by homology • What is the gene structure of my genomic sequence? • Which parts of my protein have a high antigenicity?

  28. Usage of Databases • Comparisons: • Gene Families • Phylogenetic Trees

  29. GenBank Growth Chart Bases Year

  30. Evolutionary basis of Alignment • Enable the researcher to determine if two sequences display sufficient similarity to justify the inference of homology. • Similarityis an observable quantity that may be expressed as say %identity or some other measure. • Homology is a conclusion drawn from this data that the two genes share a common evolutionary history.

  31. Sequence Formats

  32. Fasta Format >SANJAY REFORMAT of: SANJAY.seq check: 8826 from: 1 to: 573 March 12, 1998 MASSSVPPMITEEEARFEAEVSAVESWWRTDRFRLTRRPYSARDVVSLRGTLHHSYASDQ MAKKLWRTLKSHQSAGTASRTFGALDPVQVTMMAKHLDTIYVSGWQCSSTHTATNEPGPD LADYPYNTVPNKVEHLFFAQLYHDRKQHEARVSMTREQRAKTPYVDYLRPIIADGDTGFG GATATVKLCKLFVERGAAGVHIEDQSSVTKKCGHMAGKVLVAVSEHINRLVAARLQFDVM GVETVLVARTDAVAATLIQSNVDLRDHQFILGATNPDFKRRSLAAVLSAAMAAGKTGAVL QAIEDDWLSRAGLMTFSDAVINGINRQLPEYEKQRRLNEWAAATEYSKCVSNEQGREIAE RLGAGEIFWDWDIARTREGFYRFRGSVEAAVVRGRAFAPHADLIWMETSSPDLVECGKFA QGMKASHPEIMLAYNLSPSFNWDAAGMTDEEMRDFIPRIAKMGFCWQFITLGGFHADALV TDTFAREFAKQGMLAYVERIQREERNNGVDTLAHQKWSGANYYDRYLKTVQGGISSTAAM GKGVTEEQFKEESRTGTRGLDRGGITVNAKSRL

  33. GCG Format ckl.seq Length: 473 September 15, 1999 12:25 Type: P Check: 8103 .. 1 MSTKYSASAE SASSYRRTFG SGLGSSIFAG HGSSGSSGSS RLTSRVYEVT 51 KSSASPHFSS HRASGSFGGG SVVRSYAGLG EKLDFNLADA INQDFLNTRT 101 NEKAELQHLN DRFASYIEKV RFLEQQNSAL TVEIERLRGR EPTRIAELYE 151 EEMRELRGQV EALTNQRSRV EIERDNLVDD LQKLKLRLQE EIHQKEEAEN 201 NLSAFRADVD AATLARLDLE RRIEGLHEEI AFLRKIHEEE IRELQNQMQE 251 SQVQIQMDMS KPDLTAALRD IRLQYEAIAA KNISEAEDWY KSKVSDLNQA 301 VNKNNEALRE AKQETMQFRH QLQSYTCEID SLKGTNESLR RQMSEDGGAA 351 GREAGGYQDT IARLEAEIAK MKDEMARHLR EYQDLLNVKM ALDVEIATYR 401 KLLEGEESRI SLPVQSFSSL SFRESSPEQH HHQQQQPQRS SEVHSKKTVL 451 IKTIETRDGE VVSESTQHQQ DVM

  34. Taxonomy Database

More Related