Chapter 2 Sequence databases

Chapter 2Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.

2.1 General nucleic acid Sequence databases • EMBL:(European Molecular Biology Laboratory) http://www.ebi.ac.uk/Information/ • GenBank: NCBI (National Center for Biotechnology Information) http://www.ncbi.nlm.nih.gov/ • DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ Entry name; accession number; version number

2.2 General protein Sequence databases • SWISS-PROT • PIR • PRF/SEQDB • PDB: It is the largest data bank of three-dimensional (3-D) biological macromolecular structure data. coding sequences (CDS): from translation • TrEMBL • GenPret:

SWISS-PROT is a highly curated database that contains excellent documentation. SWISS-PROT systematically merges variants and fragments into a single entry, but is greatly lagging behind the growth of the DNA data banks. • PIR contains more sequences, including numerous “really sequenced” oligopeptides, but is not that tightly curated. • The “automatic” data banks such as TrEMBL and GenPept are even larger, but contain little documentation and sometimes conceptual translations that are not actually found in nature.

2.3 Nonredundant sequence databases • The analyzed results from the duplicated or redundant sequences is in bias.

2.4 Specialized sequence databases • The database forms a well-defined set of sequences • The specialized data bank is often nonredundant • The data fields definition or keywords are sometimes (better) standardized • The documentation is often more extensive HIV Databases HPVSD IMGT NRL_3D

2.5 Databases with aligned protein sequences • Numerous databases with proteins are grouped into (sub) families that are already prealigned. Blocks: local alignments without gaps DOMO: homologous domain ProDom: local alignment with gaps HSSP: global alignment FSSP global alignment http://www.ebi.ac.uk/Information/

2.6 Database documentation search • The user normally scans only the documentation that accompanies the sequences, not the sequences themselves. • GCG has a program stringsearch and EMBOSS textsearch. The major drawback of this simple type of search is the large consumption of computer time. However, the method has a virtue that can sometimes be useful: any string of characters can be sought. For example, both “HIV-1” and “HIV1” can be found. Worse are typographical errors, such as ”psuedogene” instead of “pseufogene.”

By index 2.7 ENTREZ databasehttp://www.ncbi.nlm.nih.gov/

2.8 BLAST Basic Local Alignment Search Tool • The BLAST algorithm breaks the query sequence into short fragments, or “words,” and looks for an identical or close match between those words and words from the database sequences. When such a match or “hit” is encountered, the hit is extended in both directions to generate a local alignment segment. The quality of each alignment is quantified in a score, and the high-scoring segment pairs (HSPs) are reported in a table.

BLASTN, which compares a nucleotide query sequence with a nucleotide sequence database; BLASTP, which compares a protein query sequence with a protein sequence database; BLASTX, which compares a nucleotide query sequence translated in all six open reading frames with a protein sequence database; TBLASTN, which compares a protein query sequence with a nucleotide sequence database dynamically translated in all six open reading frames; and TBLASTX, which compares a six-frame translation of a nucleotide query sequence with the six-frame translations of a nucleotide sequence database. http://www.ddbj.nig.ac.jp/

Biologically significant similarities between distantly related sequences. The Position-Specific Iterated BLAST (PSI-BLAST) program builds a position-specific scoring matrix, or profile, from the multiple alignment.

http://www.ebi.ac.uk/clustalw/

Chapter 2 Sequence databases

Chapter 2 Sequence databases

Presentation Transcript

Sequence Databases

Sequence Databases

Rationale for searching sequence databases

Sequence Databases

Sequence databases

Sequence Databases – 21 June 2007

Similarity Searches in Sequence Databases

Querying Sequence Databases

Similarity Searches on Sequence Databases

Searching Sequence Databases

Sequence Databases

Sequence Databases – 20 June 2008

Searching Sequence Databases

General nucleic acid Sequence databases

Similarity searches in sequence databases

Protein Sequence Databases

Rationale for searching sequence databases

Sequence Databases

Nucleotide Sequence Databases

Chapter 2: Spatial Databases

Protein sequence databases

BLAST and searching sequence databases