Fundamentals of Sequence Analysis

Fundamentals of Sequence Analysis Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 huynh@ncbi.nlm.nih.gov

MLH1 MutL Human Fly Worm Yeast Bacteria Molecular Evolution Common ancestry allows us to infer similar function 3000 Myr 1000 Myr 540 Myr Pancreatic carcinoma Alzheimer’s Disease Ataxia telangiectasia Colon cancer

Why do we need similarity searching? • Identification and annotation • Incomplete or no annotations (GenBank) • Incorrectly annotated sequences • Evolutionary relationships homologous molecules may have similar functions

Why Search Databases? • To find out if a new DNA sequence is already fully or partially present in the databanks. • To find homologous proteins to a putative coding ORF that might share similar 3D structure. • to identify homology (“relatedness”) between a query and entries in a database

Some Terminology

Searching Sequence Databases • Two sequences are homologous when they share a common ancestry. This ancestry is reflected in strong sequence similarity. • Computationally, threshold limits for sequence similarity can be defined by : • length of the stretch of similar sequence • percentage of identity between the sequence • statistical measurements, like E-value, P-value, Bit-score, etc.

Similarity and Homology • Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness. • Homology is an evolutionary term used to describe relationship via descent from a common ancestor. • Homologous things are often similar, but not always (whale flipper <-> human arm) • Homology is NEVER expressed as a percentage

Orthologs vs Paralogs • Homologs can be separated into two classes: orthologs and paralogs. • Orthologs are homologous genes that perform the same function in different species. • Paralogs are homologous genes within a species that may perform different functions.

Similarity and Homology • Sequence homology can be reliably inferred from statistically significant similarity over a majority of the sequence length. • Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor. • Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ <-> tubulin) • Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"

Searching sequence databases • When we search a sequence database, we are usually looking for related sequences. • Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity. • When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.

Pairwise Sequence Alignments • Purpose: • identification of sequences with significant similarity to (a) sequence(s) in a sequence-repository • identification of all homologous sequences the repository • identification of domains with sequence similarity • Terminology • Global alignment • Local alignment

Terminology: Global Alignment • Finds the optimal alignment over the entire length of the two compared sequences • Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA • Suitable for sequences of homologous molecules

Terminology: Local Alignment • short regions of similarity between a pair of sequences. • compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length • useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons

Substitutions, Insertions, Deletions • Mutation: one of • switch from one nucleotide to another • insertion • deletion • Substitution: a switch in nucleotides which spreads throughout most of a species. • Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other): cgggtatccaa cggtatgcca ccctaggtccca

Example • For the previous example cggtatgcca cgggtatccaa , ccctaggtccca, the two descendent sequences align as follows c g g g t a - - t - c c a a c c c - t a g g t c c c - a • “-” (indel) represents an insertion or deletion.

Algorithms: definition Webster’s definition: “a procedure for solving a mathematical problem in a finite number of steps that frequently involves a repetition of an operation; or broadly: a step-by-step procedure for solving a problem or accomplishing some end”

Algorithms • Needleman-Wunsch • Exhaustive global alignment • most rigorous method when aligning conserved sequences of similar length (no exon shuffling, insertion/deletion etc) • Smith-Waterman • Exhaustive local alignment • alignment does not have to extend along the full length of the sequences • In contrast to N-W alignments initiating at all possible positions of the sequence-space will be considered • Can be very slow

Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/BLAST/

Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and networkclients

BLAST Selection Matrix

Choosing The Right BLAST Flavor for Proteins Claverie & Notredame 2003

Choosing the Right BLAST Flavor for DNA Claverie & Notredame 2003

Choosing The Right BLAST Flavor for DNA Sequences Claverie & Notredame 2003

WWW BLAST

Web BLAST

BLAST Databases: Non-redundant protein nr (non-redundant protein sequences) • GenBank CDS translations • NP_ RefSeqs • Outside Protein • PIR, Swiss-Prot, PRF • PDB(sequences from structures)

BLAST Databases: Nucleic Acid • nr (nt) • Traditional GenBank Divisions • NM_ and XM_ RefSeqs • dbest • EST Division • htgs • HTG division • gss • GSS division • chromosome • NC_ RefSeqs

>Mutated in Colon Cancer IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILER VQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSS DKVYAHQMVRTDSREQKLDAFLQPLSKPLSS Protein BLAST Page Protein database

BLAST Formatting Page

BLAST Output Overview • Graphic Display: Shows you where your query is similar to other sequences. • Hit List: Name of sequences similar to your query ranked by similarity • Alignments: Every alignment between your query and the reported hits • Parameters: List of the various parameters used for the search

BLAST Output: Graphic Red bar = most similar sequence Pink = almost as similar Green – even less similarBlue/Black – worse scores Sort by taxonomy mouse over, click for active links

sorted by e values 4 X 10-56 Bit scores < 50 unreliable link to entrez LocusLink Default e value cutoff 10 Bacterial mismatch repair proteins BLAST Output: Descriptions

A Little on Interpretation • How similar must sequences be in order to be considered homologous? • More than 25% of the amino acids present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor. • Rem: only > 100 aa or nt in length

A Little on Interpretation: E-value • Determine how much you can trust your conclusion on homology. • E-value = Expectation Values • Allow for comparing pairwise alignment with different similarities and different length. Advantage over Percent Identity (not discussed). • Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The loest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology • If you want to be certain of homology your E-values must be below 10-4 or (0.0001).

TaxBLAST: Taxonomy Reports

BLAST Output: Pairwise Alignments >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%) Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ L Sbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

low complexity sequence filtered BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406

Results from nr Sequences producing significant alignments: (bits) Value gi|604369|gb|AAA85687.1| (U17857) hMLH1 gene product [Homo ... 233 3e-61 gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mut... 233 4e-61 gi|466462|gb|AAA17374.1| (U07418) human homolog of E. coli ... 233 4e-61 gi|13878583|sp|Q9JK91|MLH1_MOUSE DNA mismatch repair protei... 214 2e-55 gi|19387852|ref|NP_081086.1| (NM_026810) mutL homolog 1; DN... 213 2e-55 gi|13591989|ref|NP_112315.1| (NM_031053) mismatch repair pr... 212 5e-55 gi|12835158|dbj|BAB23172.1| (AK004105) DNA MISMATCH REPAIR ... 205 6e-53 gi|3192877|gb|AAC19117.1| (AF068257) mutL homolog [Drosophi... 128 1e-29 gi|17136968|ref|NP_477022.1| (NM_057674) Mlh1-P1 [Drosophil... 127 1e-29 gi|17861656|gb|AAL39305.1| (AY069160) GH18717p [Drosophila ... 125 8e-29 gi|20146218|dbj|BAB89000.1| (AP003238) putative MLH1 [Oryza... 87 2e-17 gi|11357265|pir||T51620 DNA mismatch repair protein MLH1 [i... 83 5e-16 gi|18413196|ref|NP_567345.1| (NM_116983) MLH1 protein [Arab... 83 5e-16 gi|6323819|ref|NP_013890.1| (NC_001145) Required for mismat... 72 1e-12 gi|460627|gb|AAA16835.1| (U07187) Mlh1p [Saccharomyces cere... 71 2e-12 gi|19112991|ref|NP_596199.1| (NC_003423) putative DNA misma... 70 5e-12 gi|13517948|gb|AAK29067.1|AF346620_1 (AF346620) MLH1 [Trypa... 57 3e-08 gi|16272041|ref|NP_438240.1| (NC_000907) DNA mismatch repai... 54 3e-07 gi|19173567|ref|NP_597370.1| (NC_003232) DNA MISMATCH REPAI... 52 9e-07 gi|13543339|gb|AAH05833.1|AAH05833 (BC005833) Similar to mu... 50 5e-06 gi|15602769|ref|NP_245841.1| (NC_002663) MutL [Pasteurella ... 50 6e-06 gi|15642797|ref|NP_227838.1| (NC_000853) DNA mismatch repai... 48 2e-05 >gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mutL (E. coli) homolog 1; coli) homolog 1 (colon cancer, nonpolyposis type 2) [Homo sapiens] gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 (MutL protein homolog 1) gi|631299|pir||S43085 DNA mismatch repair protein MLH1 - human gi|463989|gb|AAC50285.1|(U07343) hMLH1 [Homo sapiens] gi|1079787|gb|AAA82079.1|(U40978) DNA mismatch repair protein homolog [Homo sapiens] gi|13905126|gb|AAH06850.1|AAH06850 (BC006850) mutL (E. coli) homolog 1 type 2) [Homo sapiens] gi|741682|prf||2007430A DNA mismatch repair protein [Homo sapiens] Length = 756 Score = 233 bits (593), Expect = 4e-61 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

tblastn Results Against ESTs >gi|12794555|emb|AL531062.1|AL531062 AL531062 LTI_NFL001_NBC4 Homo sapiens cDNA clone CS0DM005YM23 5 prime. Length = 878 Score = 167 bits (422), Expect(3) = 1e-42 Identities = 81/82 (98%), Positives = 81/82 (98%) Frame = +2 Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 512 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 691 Query: 61 GSNSSRMYFTQTLLPGLAGPSG 82 GSNSSRMYFTQTLLPGLAGP G Sbjct: 692 GSNSSRMYFTQTLLPGLAGPLG 757 Score = 24.3 bits (51), Expect(3) = 1e-42 Identities = 11/26 (42%), Positives = 11/26 (42%) Frame = +1 Query: 80 PSGEMVKXXXXXXXXXXXXXXDKVYA 105 PSG MVK DKVYA Sbjct: 748 PSG*MVKSTTSLTSSSTSGSSDKVYA 825 combined expect for hits to multiple frames

BLAST Tips • It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides. • If in doubt use blastp. • When possible restrict to the subset of the database you are interested in. • Look around for the database you need or create your own custom BLAST database. BUT HOW??? • When is the best time to use the BLAST server?

Asking Biological Problems with BLAST Claverie & Notredame 2003

BLAST and PSI-BLAST Servers on the Internet

Alternative Method for Homology Searches • Smith-Waterman (ssearch): slower but more accurate • FASTA: slower than BLAST, but more accurate when making DNA comparison • BLAT: for locating cDNA in a genome or finding close proteins in a genome

Common Mistake Sequence 1: AAAAAABBBBBB Sequence 2: AAAAAA Sequence 3: BBBBBB • Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B • Use Seq 1 as query sequence • What happens? E-value of both of these hits may be very high if domain A and B are long and well conserved. • Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3 • Just don’t depend on the E-value • “BLAST hits are not transitive, unless the alignments are overlapping” • Most proteins have more than one domain, so becareful when looking a BLAST results, not all reported hits belong to the same big family.

Common Questions • When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference? Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit.

Self Guided Exercises - BLAST • If you need further help on Blast. • First READ then try the problem set. • Blast Course: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.htmlBlast • Tutorial: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html • Blast Quick Start (click on P for the problem set) http://www.ncbi.nlm.nih.gov/Class/minicourses/blastex2.html

Fundamentals of Sequence Analysis

Fundamentals of Sequence Analysis

Presentation Transcript

Fundamentals in Sequence Analysis 1.(part 1)

Sequence analysis

Fundamentals of RCM Analysis

Sequence analysis

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence Analysis

Sequence analysis

Sequence Analysis

Sequence Analysis

SEQUENCE ANALYSIS

Sequence analysis

Sequence Analysis

Sequence Analysis

SEQUENCE ANALYSIS

Sequence Analysis

Sequence Analysis

Sequence Analysis