Sequence Searching Strategies

Sequence Searching Strategies A guide to efficient database searching Jennifer McDowall EMBL-EBI

Overview • Know the data • The Toolbox • Search Guidelines

Know the data

Know the Data… • Many databases, each getting bigger • Efficient searching requires knowledge of what data is stored in a database • Don’t assume annotation can be transferred because of a good match • Databases can contain errors • Data can change • Deletions, sequence modifications • Daily updates, identifier changes…

Know the Data…Nucleotides EMBL-Bank • Divided into classes and divisions... • Release and updates • Supplementary sets: EMBL-CDS, EMBL-MGA Specialist databases • Immunoglobulins: IMGT/HLA, IMGT/LIGM… • Alternative splicing: ASTD… • Completed genomes: Ensembl, Integr8… • Variation: HGVBase, dbSNP…

Know the Data…Proteins UniProt • Divided into 3 sections • Release and updates Specialist databases • Sequence from structure: PDB, SGT… • Immunoglobulins: IMGT/HLA… • Alternative splicing: ASTD… • Completed proteomes: Ensembl, Integr8… • Protein interactions: IntAct • Patent proteins: EPO, USTPO, JPO, KIPO

Homology Similarity vs. • Homologous sequences share a common origin • Presence of similar features because of common decent • Statistically significant similar sequences are considered ‘homologous’ • Homology is like pregnancy: either one is or one isn’t! (Gribskov – 1999) • Similarity is a measure of the “likeness” of 2 sequences • Uses statistics to determine ‘significance’ of similarity • If significant, considered to be homologous • If not significant  uncertain • Similarity does not necessarily reflect homology

The Toolbox

Sequence Similarity Search Tools

Sequence Similarity Search Tools BLAST FASTA Iterative searches

Sequence Similarity Search Tools BLAST • NCBI-BLAST • Wu-BLAST FASTA • FASTA • SSEARCH • GGSEARCH • GLSEARCH Iterative search • PSI-BLAST • PSI-SEARCH

Tools: NCBI BLAST Protein DB • BLASTP: protein DNA DB • BLASTN: DNA Protein DB • BLASTX: translate DNA

Tools: NCBI BLAST Nucleotide search Protein search

Tools: Wu-BLAST Protein DB • BLASTP: protein DNA DB • BLASTN: DNA Protein DB • BLASTX: DNA translate Translated DNA DB • TBLASTN: protein Translated DNA DB • TBLASTX: DNA translate

Tools: Wu-BLAST Nucleotide search Protein search

Tools: FASTA Protein DB Protein DB DNA DB DNA DB protein protein or or DNA DNA • FASTA: Protein DB • FASTX/Y: DNA translate Protein DB • SSEARCH: protein • GLSEARCH: Protein DB protein • GGSEARCH:

Tools: FASTA Nucleotide search Protein search

When to use which search? NCBI BLAST Query length WU-BLAST PSI-SEARCH FASTA Database size

When to use which search? NCBI BLAST Speed of search WU-BLAST PSI-SEARCH FASTA PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

BLAST v FASTA • Fast • Excels with proteins • Good local alignments + short global alignments • Proteins: BLOSUM62(-11/-1) alignments good at >85% homology • Good at finding siblings • Slower • Excels with proteins and DNA (better than BLASTN for DNA) • Produces S-W alignments • Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology • Good at finding cousins

GLSEARCH and GGSEARCH GLSEARCH • Global (query) - Local (target DB) alignment • For global query alignments to domains/patterns in target proteins GGSEARCH • Global (query) – Global (target DB) alignment • Specific for searching short sequences against short targets or for gene-to-gene comparisons

What are global and local alignments? Query |||||||| |||||||||||||| BLAST, FASTA Local - Local Subject Query ||||||||| ||||||||||||| GLSEARCH Global - Local Subject Query GGSEARCH Global - Global ||||||||| ||||||||||||| Subject

Tools: PSI (Position Specific Iterated) Search Single Protein Sequence Search Database Estimate significance iterate Generate Alignment Construct profile

Tools: PSI Search • PSI-BLAST • Part of NCBI-BLAST package • Automatic iteration service • (PSSM = position specific scoring) • Manually guided service • PSI-SEARCH + • Combines: SSEARCH (S&W algorithm) PSI-BLAST (iterative strategy) • Manually guided service

Let’s look at a FASTA search

FASTA search Step 1: Select a database

Which database to choose? Database size is important • ENA-Annotation >124 million • UniParc (non-redundant) >24 million • Databases grow every day

How database size affects results sequence: gatctccatggg BLAST >122M >700,000 >15M >1.5M 489 hits 3 hits 60 hits 0 hits (>1000) 621.0 0.96 789.0 e-values of 100% matches

How database size affects results • Search smallest database likely to contain your sequence • Run multiple small searches (can run all ENA/UniParc as well)

Protein or nucleotide database search? Two issues are worth considering…

Protein or nucleotide database search? Codon degeneracy Ser Amino acids Ser match UCU AGC Nucleotides mismatch

Protein or nucleotide database search? Over-simple match/mismatch scoring highly conserved weakly conserved not conserved Ser Ser Ser Amino acids Leu Asn Ser mismatch identical similar UCU UCU UCU CUC AAC AGC Nucleotides no distinction mismatch mismatch mismatch

Protein or nucleotide database search? Human CKS1B kinase Zebra finch CDC28 kinase 1B v Protein Nucleotide

Protein or nucleotide search? Identify homologs searching: cyanobacteria genus Homo prokaryotes Proteins amphibians arthropods land plants eukarytoes mammals DNA archaea reptiles flowers insects plants birds fish extinction of dinosaurs today Cambrian explosion 1 multicellular life 2 Billions of years ago complex cells 3 photosynthesis 4 self-replicating cells Protein comparisons identify homologues 5-10x further back in evolution chemical evolution formation of Earth

Protein or nucleotide database search? …therefore, searching a protein database could pull out many more homologues than searching a nucleotide database …if you start with a nucleotide sequence, try BLASTX or FASTX to translate your query sequence and search a protein database

FASTA search Step 1: Select a database Step 2: Paste sequence

FASTA search Step 1: Select a database Step 2: Paste sequence Step 3: Choose parameters

Choosing parameters

Choosing parameters User manual provides help

Which parameters to choose? Matrix Nucleotide search ‘simpler’ - only match/mismatch Protein search uses substitution matrix tables (based on amino acid similarities and rate of change)

Which parameters to choose? strictness of search Choice of matrix depends on: length of query sequence QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4

Matrices - controlling search sensitivity PAM (point accepted mutation) • Based on global alignments of related proteins • 1 substitution in 100 residues = PAM 1 • Other matrices extrapolated from PAM 1 • Model of evolutionary divergence • Bias against rare substitutions (e.g. Cys → Tyr) due to seed proteins

Matrices - controlling search sensitivity BLOSUM (BLOCKS amino-acid substitution) • Based on protein domain alignments from the BLOCKS database • Observed substitutions in conserved domains • Based on percentage identity, so BLOSUM50 is deeper than BLOSUM80

10 100 200 300 400 500 Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

Which parameters to choose? Matrix - protein Match/mismatch - nucleotide FASTA BLAST ...instead have...

Match/mismatch scores • “Reward” for match, “penalty” for mismatch • Reward/penalty ratio: • Increase ratio to find more divergent sequences: • Ratio of 0.33 (1/-3) for 99% conserved • Ratio of 0.5 (1/-2) for 95% conserved • Ratio of 1 (1/-1) for 75% conserved

Which parameters to choose? gap penalties Nucleotide search gap open = -2 to -16 Gap extension = 0 to -4 Protein search gap open = 0 to -23 Gap extension = 0 to -8

Which parameters to choose? Choice of gap penalties depends on: strictness of search • larger penalty  fewer gaps to match scoring matrix QUERY LENGTH MATRIX open ext >300 BLOSUM50 -10 -2 85-300 BLOSUM62 -7 -1 50-85 BLOSUM80 -16 -4 >300 PAM250 -10 -2 85-300 PAM120 -16 -4 35-85 MDM40 -12 -2 <=35 MDM20 -22 -4 <=10 MDM10 -23 -4

Which parameters to choose? • KTUP = ‘word-length’ of search • Large word-length  less sensitive •  faster KTUP (word length) Nucleotide search - fewer bases than amino acids  higher KTUP

Which parameters to choose? Do I mask my sequence? • Low complexity regions should be masked to avoid spurious results • CA repeats • poly-A tails • proline-rich regions **Be careful you don’t mask what you are looking for

Sequence Searching Strategies

Sequence Searching Strategies

Presentation Transcript

Web Searching Strategies

Advanced Searching Strategies

Tree Searching Strategies

BLAST Sequence Searching in Registry

Sequence Similarity Searching

Internet Searching Strategies

Searching Strategies

Advanced Searching Strategies

Sequence Database Searching

Searching Sequence Databases

Searching Sequence Databases

Sequence Alignment and Database Searching

Rationale for searching sequence databases

Searching Strategies

Tree Searching Strategies

Previous Lecture: Sequence Database Searching

Sequence based searching

Searching Strategies

BLAST and searching sequence databases

Tree Searching Strategies

Tree Searching Strategies

Sequence Similarity Searching