Biological Databases

Biological Databases Biology outside the lab

Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize, and index the data and for specialized tools to view and analyze the data.

Information flux from data to decision Biology, Chemistry and Pharmaceutical research generate an huge amount of data. Information analysis rate is smaller than data production. Human Genome progect: 22.1 bilion bases sequenced but … what we do really know about it?

Bioinformatics • Building and managing of biological databases (nucleotides, proteins, structures, small molecules, pathways, literature, …) • Data mining and data analysis (Computational Biology) • protein modelling ab initio – Homology modelling – simulations (Molecular Modeling)

Literature databases http://www.ncbi.nlm.nih.gov/

Nucleotide databases

Protein databases • Uniprot databases: - Swiss-prot: provide a high level of annotation, minimal level of redundancy and high level of integration with other databases - TrEMBL: a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. • NCBI protein database (meta-database containing sequences from Uniprot entries, PDB derived sequences and translation from predicted ORF in genebank)

Structural Database Protein structures obtained by crystallography or NMR are stored in PDB.

Microarray Databases • GEOminibus • SMD Stanford Microarray Database Gene expression databases provides rough data of microarray expression. Data originated by different experiments can be merged to obtain previously unidentified results.

EST Databases • EST: Expressed Sequence Tags 5’ EST : These regions tend to be conserved across species and do not change much within a gene family 3’ EST: Because these ESTs are generated from the 3' end of a transcript, they are likely to fall within non-coding, or untranslated regions (UTRs), and therefore tend to exhibit less cross-species conservation than do coding sequences. Sequence Tagged Site (STS): help to locate a gene in the genome. 3’EST are a good source of STS Available DBs: Genebank – dbEST – Unigene

Tools • ORF finder • Blast • Multiple alignment • Conserved Domain Identification • Secondary structure and Folding Prediction

sequencing ORF identification Example 1 A recombinant plasmid containing clone shows an interesting phenotype Rough sequence • Phylogenetically similar sequences • Conserved Domain Blast In-frame sequence

CDS

Example 2

Exampe 2

Example 2 Tune the method • Increase window size in evaluating score • - increase local information integrating “environmental” data • - 2 residues window -> 2 frames • 3 residues window -> 3 frames • …. • b) Use degenerate matching methods (based on size, polarity, h-bond behavior, …)

Biological Databases