1 / 24

Similarity Searches on Sequence Databases

Similarity Searches on Sequence Databases. Chapter 7; Page:215. A story. H. pylori was discover in 1984 its genome was first sequenced in 1990s this was published in NATURE. In this publication, all proteins translated by the genome were also published HOW did they do in a short time?.

tangia
Download Presentation

Similarity Searches on Sequence Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Searches on Sequence Databases Chapter 7; Page:215

  2. A story • H. pylori was discover in 1984 • its genome was first sequenced in 1990s • this was published in NATURE. • In this publication, all proteins translated by the genome were also published • HOW did they do in a short time?

  3. HOW? • They compare the sequence of the genome of H. pylori with those of other bacteria. • Then they predicted the proteins of H. pylori and its metabolits.

  4. What does this similarity mean? • if two protein or gene sequences are similar, they are homologues. • SO • They are from similar organisms • similar proteins means; • similar functions • similar structures • that is, similar charactersitics

  5. How similar is very similar • For proteins; • if >25% identity between 2 proteins, they are similar The range of identity <25% is called the TWILIGHT ZONE. Nothing is sure about similarity. For nucleotides, the limit is 70% similarity (homologous)

  6. Homology • Addition to %, some other information is essential to say that there is a homology between 2 ones: • Expectation value: less value, more homology, • Lenght of the similar segments • Patterns of a.a conservation • Number of insertions/deletions

  7. BLAST (Basic Local Assightment and Search Tool) • 30 years ago, to scan the simility between our query and hundreds of others we would need several hours :-(print, put on the wall, compare one by one manualy:-) • NOW, by speedy computers, we compare ours with millons at most in several minutes.

  8. BLASTing Protein Sequence • 2 strategies • Compare; • a protein with a protein database : BLASTP • a protein with a nucleotide database : TBLASTN (machine turns your nucleotide seq. into 6 possible sequence) Important BLAST servers • BLAST server from NCBI from USA • BLAST server from Swiss EMBnet • if U learn one, U use other(s)

  9. Which we should choose • Dependin on; • Database: Choose the one using a database you want • Speed: Choose the one which is not crowded (in Turkey, no problem during day until 5 because US and Japan in dark) different BLAST servers return different results instead of the same query because of differences between their databases

  10. BLAST output contains; • A graphic display • A hit list • The alighments • The parameters

  11. A graphic display • which part of other sequences is similar to yours • This part can be different or absent in some servers. • What colors say: best, good, moderate,worse, worst • what does length say: the same length...homologous, shorter corresponds to the domain

  12. A hit list • Accesion number (sp:SWISS-PROT) & name • Description: You estimate whether it is interested or not • Score: if <50, unreliable • E-value: lower E, more similarity; E>0.001.twilight zone. E approaching “0” is the best

  13. Alignments • Alignments say smthng on similarities btw seq • % identity: >25% is good • length:length of alignment. short alignments gives generally high E values • Top is ours; bottom is hit; (+) shows similar aa • XXXXXX: low complexity region • numbers shows the coordinates

  14. BLASTing DNA sequences • If it is reading frame, tranlate it to protein than blast. • if not choose one of them below a DNA from DNA: BLASTN a TDNA from TDNA: TBLASTX a TDNA from protein: BLASTX T:translated; it means blast tanslates our sequence into 6 possible protein sequence

  15. Strategies for right choice of BLAST type for DNA

  16. controlling blast: right parameters

  17. Control sequence masking • Protein: Remove low-complexity regions • DNA: many repeats. filter”human repeats”

  18. BLAST output • a less homologous sequence can be important WHAT? Adjust parameters • suitable database: decrease results, use swiss p. • use the magic tags of enrez query • Adjust E-value

  19. PSI-BLAST (Position Specific Iterated-BLAST) • BLAST finds close relatives. • To find far relatives, use PSI-BLAST • It uses more complex scoring procedures.

More Related