200 likes | 415 Views
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES. Lecturer: Junaid Gamieldien, PhD junaid@sanbi.ac.za. Day2: Specialized Databases. http://www.sanbi.ac.za/training-2/undergraduate-training/. WHAT YOU NEED TO LEARN:.
E N D
BTN323:INTRODUCTION TO BIOLOGICAL DATABASES Lecturer: Junaid Gamieldien, PhD junaid@sanbi.ac.za Day2: Specialized Databases http://www.sanbi.ac.za/training-2/undergraduate-training/
WHAT YOU NEED TO LEARN: • What are protein pattern/fingerprint/motif databases and why are they important? • What are the benefits using ontologies in database design? • How do model organism databases support human health research?
PATTERN DATABASES • Sometimes alignment-based methods find no hits to provide us with clues about a novel gene/protein’s function • Then we turn to finding MOTIFS - common conserved sequence elements in protein families • In many cases a motif consists of distinct subparts that are highly conserved in the sequences, while the regions between these subparts have little in common. • If we have a database of these patterns, we can assign potential function to a novel protein by finding one or more known motifs…
Protein • Similar sequence Similar function • Also true for subsections of a protein • Motifs or signature sequences e.g. DNA binding motifs EVOLUTIONARY CONSTRAINT! Sequence B Sequence A
INTERPRO: INTEGRATED PATTERN DATABASE • Integrated resource for protein families, domains, regions and sites • Combines several databases that use different methodologies well-characterised proteins to derive protein signatures. • Capitalises on their individual strengths => powerful integrated database and diagnostic tool (InterProScan)
MEMBER DATABASES • ProDom: provider of sequence-clusters • PROSITE patterns: regular expressions. • PRINTS provide protein ‘fingerprints’ • PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).
INTERPRO PROTEIN ‘SITES’ • Conserved Site - any short sequence pattern that may contain one or more unique residues • Active sites - one or more signatures cover all the active site residues • Binding sites bind chemical compounds • A Post-translational Modification modifies the primary protein structure, eg. glycosylation, phosphorylation, etc.
INTERPRO SEQUENCE ANALYSIS: INTERPROSCAN • Searching against different functional site databases has become a vital for the prediction of protein function (where e.g. BLAST fails). • Different DB’s have different strengths and weaknesses of their underlying analysis methods. • Ideally, all of the secondary databases should be searched against to ensure the best results. • This is exactly what InterProScan does (part of todays practical topic)
BIO-ONTOLOGIES • Community developed agreements on terms/concepts describing a topic and also the relationships between them • The Gene Ontology (GO) is the most widely used • The GO provides common language to describe a gene product's biology in terms of: • Molecular Function • Biological Process • Cellular Location • Several others e.g. anatomy, cell types, disease, phenotype, pathway, …
involves GENE-X
ADVANTAGES OF GO (AND MANY OTHER BIO-ONTOLOGIES) IN DB DESIGN • A common language applicable to any organism • Represents and organises information in a way that both humans and machines can understand • GO terms can be used to annotate gene products from any species • Enables easy comparison of information across species
ADVANTAGES OF GO (AND MANY OTHER BIO-ONTOLOGIES) IN DB DESIGN (2) • Terms make good entry points for database searches • Researchers can search for what they really mean (and meaning is more consistent between individuals) • Transitive links of biological objects query term via it’s child terms ensures that ALL relevant results are returned automatically • Reverse’ queries can easily be done to return termswhen biological objects are used as queries
GENE-X will be returned even if query is done at this level involves GENE-X Using GENE-X as the query can return ‘cytokinesis’ and even all its parent terms
MODEL ORGANISM GENETIC DATABASES • Very useful for collecting results from genetic (and other) experiments that cannot be done on humans • Disease models • Gene knockouts • Drug testing • Environmental manipulation • In terms of genomics, model organism data is invaluable to unravel: • Gene and protein functions • Gene to phenotype relationships • Gene to disease associations • The aim of these databases is to integrate all relevant information in one place • More easy to mine database for novel associations • Enables linking between databases
RAT AND MOUSE GENOME DB’S – DATA TYPES • Genes, proteins and their annotations including Gene Ontology links and expression information • Phenotypes – described by terms in the Mammalian Phenotype Ontology • From gene knockout models produced by the project and their partners • From evidence mined from the literature • Disease, Pathway and Behaviour ontologies and relevant gene associations also present in RGD
DESIGNED FOR EASE OF USE • Web query interfaces are intuitive • Several traditional ways to query – gene names, symbols, chromosomal location • Query interfaces for ontologies (Disease, Phenotype, Pathway, Behaviour) • Ontology annotations can easily be retrieved for any gene or protein • Both databases have links to human genes, which simplifies mouse and rat evidence-driven in-silico exploration into human diseases and phenotypes