120 likes | 326 Views
Pairwise Sequence Analysis-I. Analogous Concepts Domain Concepts Dot-matrix Analysis Pairwise Sequence Analysis Why is it useful? What are the underlying concepts? What algorithms are used? Limitations/Open questions. Concepts. Pairwise “everything” analysis
E N D
Pairwise Sequence Analysis-I • Analogous Concepts • Domain Concepts • Dot-matrix Analysis • Pairwise Sequence Analysis • Why is it useful? • What are the underlying concepts? • What algorithms are used? • Limitations/Open questions Lecture 2 CS566
Concepts • Pairwise “everything” analysis • Humans are experts at pairwise comparison • “Peas of a pod” • Basis for pecking order in society • Basis for notion of fairness (“He got a better deal than I did!”) • Basis for grouping • Basis for sorting/ranking algorithms • Basis for search strategies Lecture 2 CS566
Concepts • Identity versus similarity • Brand name versus generic products • Coincidence versus plagiarism • Co-discoveries versus “Microsoft Explorer” • “What you see is not what you got” • Process not evident from product • “If it ain’t broke, don’t fix it” • Why messy code that works is left to remain messy Lecture 2 CS566
Domain Concepts • Sequences change over generations • More changes in DNA than in proteins (Why?) • Causes of change • Error rates of replication ~ 1 in a billion • Exchange • Legitimate: Recombination • Illegitimate: Lateral transfer (a la Napster) • Mutation • Exposure to chemicals Lecture 2 CS566
Dot-matrix Analysis Lecture 2 CS566
Why (Motivation) • Given sequence A, what are its properties? • Find any sequence(s) B similar to A, where B {Sequences with known properties} • Given sequences A & B, are they • Identical? • Similar? • Evolutionary related? • Given sequence A, is this a new discovery? • Given a set of sequences, can they be clustered into families of related sequences? Lecture 2 CS566
Components • Compare the two sequences and assign score • Evaluate the score • Use of a scoring scheme to assign a value to a candidate alignment • Finding the best alignment between two sequences • Use of a probability model to assess the significance of the similarity (coincidence or plagiarism?) Lecture 2 CS566
Concepts - Alignment • Sequences may be aligned ‘locally’ (look for regions/subsequences that are similar) or ‘globally’ (align along entire length) • The result of a local alignment may differ from that a global alignment (Why?) • Potentially large number of alignments to be considered • Each alignment is a path through a fully connected dot matrix graph, i.e., a subset of the set of all diagonal edges, connected by horizontal or vertical lines. Lecture 2 CS566
Concepts – Scoring scheme • Scores are based on amino-acid substitution matrices (log-odds ratio of observed versus expected random substitutions) • PAM (Percent amino acid substitution) matrices: Based on evolutionary model. E.g., PAM 1%, PAM 250%. • BLOSUM (Blocks substitution) matrices: Based on percent identity. E.g., Blosum50, Blosum62. Lecture 2 CS566
Concepts – Scoring scheme Identity based scoring (match XOR non-match) Isawa --- dancer Isawatapdancer Isawa ---- danc-er Isawapinkpanther Similarity based scoring (partial matching) Eyl dkv Lecture 2 CS566
Algorithms • Optimal/Exact solutions: • Take longer time • Typically used for comparing a small number of sequences • (Needleman-Wunsch; Smith-Waterman) • Heuristic: • Frequently close to the optimal solution • Rapid • Typically used for 1:n searches involving large number of sequences • (BLAST, FASTA) Lecture 2 CS566
Open Questions/Limitations Highly similar sequences or highly dissimilar sequences are easily found BUT • Gray area lies in weak similarity; probability distribution is continuous • Other forms of similarity (e. g., structural or species similarity) may be necessary reach a decision • Sequence similarity per se does not guarantee functional equivalence; small difference can be responsible for large difference in function Lecture 2 CS566