1 / 82

Sequence Searching Strategies

Sequence Searching Strategies. A guide to efficient database searching. Jennifer McDowall EMBL-EBI. Overview. Know the data The Toolbox Search Guidelines. Know the data. Know the Data…. Many databases, each getting bigger

ricky
Download Presentation

Sequence Searching Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Searching Strategies A guide to efficient database searching Jennifer McDowall EMBL-EBI

  2. Overview • Know the data • The Toolbox • Search Guidelines

  3. Know the data

  4. Know the Data… • Many databases, each getting bigger • Efficient searching requires knowledge of what data is stored in a database • Don’t assume annotation can be transferred because of a good match • Databases can contain errors • Data can change • Deletions, sequence modifications • Daily updates, identifier changes…

  5. Know the Data…Nucleotides EMBL-Bank • Divided into classes and divisions... • Release and updates • Supplementary sets: EMBL-CDS, EMBL-MGA Specialist databases • Immunoglobulins: IMGT/HLA, IMGT/LIGM… • Alternative splicing: ASTD… • Completed genomes: Ensembl, Integr8… • Variation: HGVBase, dbSNP…

  6. Know the Data…Proteins UniProt • Divided into 3 sections • Release and updates Specialist databases • Sequence from structure: PDB, SGT… • Immunoglobulins: IMGT/HLA… • Alternative splicing: ASTD… • Completed proteomes: Ensembl, Integr8… • Protein interactions: IntAct • Patent proteins: EPO, USTPO, JPO, KIPO

  7. Homology Similarity vs. • Homologous sequences share a common origin • Presence of similar features because of common decent • Statistically significant similar sequences are considered ‘homologous’ • Homology is like pregnancy: either one is or one isn’t! (Gribskov – 1999) • Similarity is a measure of the “likeness” of 2 sequences • Uses statistics to determine ‘significance’ of similarity • If significant, considered to be homologous • If not significant  uncertain • Similarity does not necessarily reflect homology

  8. The Toolbox

  9. Sequence Similarity Search Tools

  10. Sequence Similarity Search Tools BLAST FASTA Iterative searches

  11. Sequence Similarity Search Tools BLAST • NCBI-BLAST • Wu-BLAST FASTA • FASTA • SSEARCH • GGSEARCH • GLSEARCH Iterative search • PSI-BLAST • PSI-SEARCH

  12. Tools: NCBI BLAST Protein DB • BLASTP: protein DNA DB • BLASTN: DNA Protein DB • BLASTX: translate DNA

  13. Tools: NCBI BLAST Nucleotide search Protein search

  14. Tools: Wu-BLAST Protein DB • BLASTP: protein DNA DB • BLASTN: DNA Protein DB • BLASTX: DNA translate Translated DNA DB • TBLASTN: protein Translated DNA DB • TBLASTX: DNA translate

  15. Tools: Wu-BLAST Nucleotide search Protein search

  16. Tools: FASTA Protein DB Protein DB DNA DB DNA DB protein protein or or DNA DNA • FASTA: Protein DB • FASTX/Y: DNA translate Protein DB • SSEARCH: protein • GLSEARCH: Protein DB protein • GGSEARCH:

  17. Tools: FASTA Nucleotide search Protein search

  18. When to use which search? NCBI BLAST Query length WU-BLAST PSI-SEARCH FASTA Database size

  19. When to use which search? NCBI BLAST Speed of search WU-BLAST PSI-SEARCH FASTA PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

  20. BLAST v FASTA • Fast • Excels with proteins • Good local alignments + short global alignments • Proteins: BLOSUM62(-11/-1) alignments good at >85% homology • Good at finding siblings • Slower • Excels with proteins and DNA (better than BLASTN for DNA) • Produces S-W alignments • Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology • Good at finding cousins

  21. GLSEARCH and GGSEARCH GLSEARCH • Global (query) - Local (target DB) alignment • For global query alignments to domains/patterns in target proteins GGSEARCH • Global (query) – Global (target DB) alignment • Specific for searching short sequences against short targets or for gene-to-gene comparisons

  22. What are global and local alignments? Query |||||||| |||||||||||||| BLAST, FASTA Local - Local Subject Query ||||||||| ||||||||||||| GLSEARCH Global - Local Subject Query GGSEARCH Global - Global ||||||||| ||||||||||||| Subject

  23. Tools: PSI (Position Specific Iterated) Search Single Protein Sequence Search Database Estimate significance iterate Generate Alignment Construct profile

  24. Tools: PSI Search • PSI-BLAST • Part of NCBI-BLAST package • Automatic iteration service • (PSSM = position specific scoring) • Manually guided service • PSI-SEARCH + • Combines: SSEARCH (S&W algorithm) PSI-BLAST (iterative strategy) • Manually guided service

  25. Let’s look at a FASTA search

  26. FASTA search Step 1: Select a database

  27. Which database to choose? Database size is important • ENA-Annotation >124 million • UniParc (non-redundant) >24 million • Databases grow every day

  28. How database size affects results sequence: gatctccatggg BLAST >122M >700,000 >15M >1.5M 489 hits 3 hits 60 hits 0 hits (>1000) 621.0 0.96 789.0 e-values of 100% matches

  29. How database size affects results • Search smallest database likely to contain your sequence • Run multiple small searches (can run all ENA/UniParc as well)

  30. Protein or nucleotide database search? Two issues are worth considering…

  31. Protein or nucleotide database search? Codon degeneracy Ser Amino acids Ser match UCU AGC Nucleotides mismatch

  32. Protein or nucleotide database search? Over-simple match/mismatch scoring highly conserved weakly conserved not conserved Ser Ser Ser Amino acids Leu Asn Ser mismatch identical similar UCU UCU UCU CUC AAC AGC Nucleotides no distinction mismatch mismatch mismatch

  33. Protein or nucleotide database search? Human CKS1B kinase Zebra finch CDC28 kinase 1B v Protein Nucleotide

  34. Protein or nucleotide search? Identify homologs searching: cyanobacteria genus Homo prokaryotes Proteins amphibians arthropods land plants eukarytoes mammals DNA archaea reptiles flowers insects plants birds fish extinction of dinosaurs today Cambrian explosion 1 multicellular life 2 Billions of years ago complex cells 3 photosynthesis 4 self-replicating cells Protein comparisons identify homologues 5-10x further back in evolution chemical evolution formation of Earth

  35. Protein or nucleotide database search? …therefore, searching a protein database could pull out many more homologues than searching a nucleotide database …if you start with a nucleotide sequence, try BLASTX or FASTX to translate your query sequence and search a protein database

  36. FASTA search Step 1: Select a database Step 2: Paste sequence

  37. FASTA search Step 1: Select a database Step 2: Paste sequence Step 3: Choose parameters

  38. Choosing parameters

  39. Choosing parameters User manual provides help

  40. Which parameters to choose? Matrix Nucleotide search ‘simpler’ - only match/mismatch Protein search uses substitution matrix tables (based on amino acid similarities and rate of change)

  41. Which parameters to choose? strictness of search Choice of matrix depends on: length of query sequence QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4

  42. Matrices - controlling search sensitivity PAM (point accepted mutation) • Based on global alignments of related proteins • 1 substitution in 100 residues = PAM 1 • Other matrices extrapolated from PAM 1 • Model of evolutionary divergence • Bias against rare substitutions (e.g. Cys → Tyr) due to seed proteins

  43. Matrices - controlling search sensitivity BLOSUM (BLOCKS amino-acid substitution) • Based on protein domain alignments from the BLOCKS database • Observed substitutions in conserved domains • Based on percentage identity, so BLOSUM50 is deeper than BLOSUM80

  44. 10 100 200 300 400 500 Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

  45. Which parameters to choose? Matrix - protein Match/mismatch - nucleotide FASTA BLAST ...instead have...

  46. Match/mismatch scores • “Reward” for match, “penalty” for mismatch • Reward/penalty ratio: • Increase ratio to find more divergent sequences: • Ratio of 0.33 (1/-3) for 99% conserved • Ratio of 0.5 (1/-2) for 95% conserved • Ratio of 1 (1/-1) for 75% conserved

  47. Which parameters to choose? gap penalties Nucleotide search gap open = -2 to -16 Gap extension = 0 to -4 Protein search gap open = 0 to -23 Gap extension = 0 to -8

  48. Which parameters to choose? Choice of gap penalties depends on: strictness of search • larger penalty  fewer gaps to match scoring matrix QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4

  49. Which parameters to choose? • KTUP = ‘word-length’ of search • Large word-length  less sensitive •  faster KTUP (word length) Nucleotide search - fewer bases than amino acids  higher KTUP

  50. Which parameters to choose? Do I mask my sequence? • Low complexity regions should be masked to avoid spurious results • CA repeats • poly-A tails • proline-rich regions **Be careful you don’t mask what you are looking for

More Related