190 likes | 371 Views
Bioinformatics Part 2: The Primary Databases. The nucleic acid and protein databases Database content: what’s in the databases and how are the records structured? Searching of and retrieval from the databases. The Primary Nucleic Acid and Protein Databases. Nucleic acids GenBank, EMBL, DDBJ
E N D
BioinformaticsPart 2: The Primary Databases • The nucleic acid and protein databases • Database content: what’s in the databases and how are the records structured? • Searching of and retrieval from the databases
The Primary Nucleic Acid and Protein Databases • Nucleic acids • GenBank, EMBL, DDBJ • Proteins • SWISS-PROT/TrEMBL, PIR and others • Search and retrieval tools • Entrez, Sequence Retrieval System (SRS)
GenBank, etc. SWISS-PROT, etc. Using the Primary Databases Entrez SRS Query Result
A Nucleic Acid Database: GenBank • Nucleotide sequences of genes and parts of genes: highly annotated • Sequence tagged sites (STSs) • Expressed sequence tags (ESTs) • Genome survey sequences (GSSs) • High throughput genomic sequences (HTGs) • Nucleotide sequences that form part of a patent
Sequence Tagged Sites (STSs) • A short DNA sequence, up to 500 nucleotides in length • Unique in the genome • Location in the genome is known • Can be detected using the polymerase chain reaction (PCR) • Act as “beacons” or “landmarks” for genome mapping studies
Expressed Sequence Tags (ESTs) • Similar to STSs, but obtained from cDNA instead of genomic DNA • Unlike STSs, ESTs may not be unique, as some genes have very similar or even identical sequences • An EST may also be an STS • ESTs provide an indication of gene density • January 2002: almost 4 million ESTs identified for the human genome
Genome Survey Sequences (GSSs) • Random “single pass read” genome survey sequences • Cosmid/BAC/YAC end sequences • Exon-trapped genomic sequences • exon-trapping is a technique that removes introns from a cloned segment of genomic DNA • Alu PCR sequences • Alu PCR amplifies genomic DNA between Alu repeats (short, dispersed elements found in the human genome)
High Throughput Genomic Sequences (HTGs) • An “unfinished” HTG would contain a few contigs (each at least 2 kbp in length), with gaps, possibly unordered, and derived from a single genomic DNA clone • A “finished” HTG would be the assembled sequence with no gaps, and with annotations unfinished finished
What’s in a “Full” GenBank Record? • LOCUS, DEFINITION, ACCESSION, KEYWORDS • SOURCE • REFERENCE - including publication details if available • COMMENT • FEATURES - exons, introns, location of coding sequence (CDS), translation of CDS, etc. • BASE COUNT • ORIGIN - the nucleic acid sequence
The Entrez Nucleotides Search and Retrieval System • By default, all major nucleotide databases (GenBank, EMBL, etc.) are searched • Allows limits to be placed on the search (e.g., to a particular field such as keyword, organism, etc.) • Allows subsets of the databases to be searched • Accepts Boolean operators (AND, OR, NOT) • Previous searches can be combined • Results can be saved to a clipboard
The Sequence Retrieval System (SRS) • Performs a similar function to Entrez • SRS can search across several databases simultaneously • Databases to be searched can be defined by the user • Uses a single interface to design the query
A Protein Database: SWISS-PROT/TrEMBL • SWISS-PROT was created in 1986 to provide highly curated, richly annotated records of protein sequences • TrEMBL (containing translations of coding sequences in EMBL) was created in 1996 to provide a supplement to SWISS-PROT • provides less detailed information than SWISS-PROT but allows access to recent sequences • Advanced searching is available through the Sequence Retrieval System (SRS)
What’s in a SWISS-PROT Record? • Identification (ID), accession (AC), dates of entry and modification (DT), description (DE), gene name (GN), organism details (OS, OC, etc.) • Reference details (RN, RA, etc.) • Comments (CC) • Database cross-references (DR) • Keywords (KW) • Feature table (FT) • Amino acid sequence (SQ)