1 / 40

Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

Blast and Alignments. Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007. Buscar un art ículo de investigación relacionado con su proyecto y que tenga un alto componente bioinformático. Por ejemplo: Generación de una base de datos

telyn
Download Presentation

Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blast and Alignments BioinformaticsDr. Víctor Treviñovtrevino@itesm.mxA7-421Ext-4536+103BT4007

  2. Buscar un artículo de investigación relacionado con su proyecto y que tenga un alto componente bioinformático. Por ejemplo: • Generación de una base de datos • Desarrollo de un programa o servicio • Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda de métodos bioinformáticos • Proponer el paper al profesor y confirmar • Estudiar el paper • Preparar presentación • Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de preguntas • Las presentaciones las evalua el profesor y los alumnos, se lleva una rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc, Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo Presentaciones de Papers en Marzo

  3. Papers for NEXT Session

  4. Sequences are similar because are derived from a common ancestor Will most often be the result of duplication events. Similarity will then depend on diveregence times. General Rule: 25% Identity in 100 aa sequence is good evidence of common ancestry SEQUENCE SIMILARITY Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

  5. Within a protein sequence, some regions will be more conserved than others. As more conserved, more important. • for function • for 3D structure • for localization • for modification • for interaction • for regulation/control • for transcriptional regulation (in DNA) SEQUENCE SIMILARITY REASONS TO PERFORM SEQUENCE SIMILARITY SEARCHES

  6. Homologous: similar due to common ancestry Analogous: similar due to convergent evolution Orthologous: homologous with conserved function (by speciation in separated species) Paralogous: homologous with different function (commonly within the same species) SEQUENCE SIMILARITY - TERMS Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

  7. Xenologous: due to horizontal transfer • HGT: transfer of genetic material that is not its offspring • VGT: transfer of genetic material from its ancestor (mitosis) [vgt is not related to xenologous] • Ohnologous: paralogous that have originated by whole genome duplication • Gametologous: homologous genes in non-recombining opposite sex chromosomes. SEQUENCE SIMILARITY - TERMS Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia

  8. SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  9. SEQUENCE SIMILARITY – ORIGINS OF GENES a1-S1 and a1-S2 are Orthologous a2-S1 and a2-S2 are Orthologous a1 & a2 are Paralogous Analogous Genes – Same Function Different Origin Xenologous Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  10. …ACCAGTGTGCCGTACA… • Mutations occur during evolution by • Insertions …ACCAGTaGTGCCGTACA… • Deletions …ACCAGTCCGTACA… • Substitutions …ACCAGTGCGCCGTACA… SEQUENCE SIMILARITY – TYPES OF MODIFICATION GTG

  11. SIMILARITY is the maximal SUM of WEIGHTS for the conserved residues • More useful for phylogenetic tree reconstruction • DISTANCE is the minimal SUM of WEIGHTS for a set of mutations transforming one sequence into the other • More useful for database searching • Both are opposite and interconvertible concepts • WEIGHT accounts for different roles of mutation events, AA residue similarity, etc. • e.g. synonymous mutations are different than non-sense mutations SIMILARITY AND DISTANCE BETWEEN SEQUENCES Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

  12. Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences • Identical residues (nt or aa) are placed in the same column • Non-identical residues can be placed in the same column or indicated as gaps SEQUENCE ALIGNMENT Overall similitude Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm

  13. GLOBAL - Procedure applied to the entire sequence to include as many matches as possible up to the end of the sequence • Methods • Brute Force – unpractical • Dot Matrix – graphical, easy to understand • Dynamical Programming – the most accurate • Heuristic Methods – fast, not so accurate • Word k-tuple – Database Searching – BLAST SEQUENCE ALIGNMENT Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia

  14. Proteins are MODULAR • Patterns formed by exchange of whole EXONS • Example: • F12 : Coagulation Factor XII • PLAT: Tissue-type plasminogen activator GLOBAL AND LOCAL ALIGNMENTS • GLOBAL • ALIGNMENT • METHODS • DO NOT • CONSIDER • THIS ISSUES LOCAL ALIGNMENT F1/2 - Fibronectins E - Epidermal Growth Factors K - "Kringle" domain A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

  15. GLOBAL AND LOCAL ALIGNMENTS Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  16. Alignment stops at the end of regions of identity or strong similarity Much higher priority is given to find these local regions than extending the alignment LOCAL ALIGNMENT A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

  17. Primary method for comparing sequences Provides a global and local overview of similarity Useful for direct or inverted repeats Useful for self-complementary RNA regions DNA Straider, DOTTER, GCG-DOTPLOT, DOTLET DOT-MATRIX METHOD http://myhits.isb-sib.ch/cgi-bin/dotlet Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  18. Align, the aa sequence "DOROTHYHODGKIN" vs "DOROTHYCROWFOOTHODGKIN" DOT-MATRIX METHOD Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  19. DOT-MATRIX METHOD – EX 1 WINDOW SIZE = 11 STRINGENCY = 7 (how many identical) window …ACCAGTGTGCCGTACA… Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  20. DOT-MATRIX METHOD – EX 2 A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

  21. DOT-MATRIX METHOD – Ex 3 -Repeats Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  22. (you could use PubMed also) DOT-MATRIX METHOD – Programs Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007

  23. http://hits.isb-sib.ch/util/dotlet/doc/dotlet_examples.html http://myhits.isb-sib.ch/cgi-bin/dotlet DOT-MATRIX EXAMPLES

  24. Provides the very best or optimal alignment in a very reasonable amount of time Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the alignment by chance of unrelated sequences There is a method for statistical significance Results depends on the scoring system Dynamic Programming Method

  25. Provides the very best or optimal alignment Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the alignment by chance of unrelated sequences There is a method for statistical significance Dynamic Programming Method

  26. Results depend on the scoring system – SCORING MATRICES • Depending on Pair-wise • Gap Penalties • DNA alignments require a similar scoring system Dyn.Prog.Method - Scoring

  27. Dynamic Programming Method Gap penalties from the scoring matrix j x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  28. Dynamic Programming Method Gap penalties from the scoring matrix j x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  29. Dynamic Programming Method Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  30. DYNAMIC PROGRAMMING EXAMPLE X=1 Y=1 GapW(x=1) = 1, W(x=2)=1 … Gap W(y = 1)=1,… s(a,b)=2, if a = b s(a,b)=0, if a <> b ACGGATAT --GGCTA-

  31. Results depend on the scoring system – SCORING MATRICES • Depending on Pair-wise • Gap Penalties • Dayhoff PAM (point accepted mutations) matrix is based on a evolutionary model for proteins • One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed in very similar sequences • BLOSUM matrix are designed to identify members of the same family • Derived from BLOCKS database (for distant sequences, blocks substitution matrix) Dyn.Prog.Method - Scoring

  32. Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press Dynamic Programing - Scoring • Remember "SUM OF WEIGHTS" for similarity/distance BLOSUM62, seq 62% identical can be merged into one. BLOSUM90 for comparing more similar sequences. BLOSUM30 for very different. PAM250 is 250 times PAM

  33. Some programs provide alternative alignments, depending on the goal • domains • structural • same family • biological function • common ancestor • There are several variations respect to original Needleman-Wunsch, Smith-Waterman methods improving memory usage, cpu time, and other features Dynamic Programming Method

  34. Dynamic Programming Method - Output Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

  35. To assign a p-value, we could "shuffle" both sequences 100,000 times. • The proportion of times we obtain SCORES larger than that obtained in the real score represent the p-value • Another quicker method is converting the alignment to BINARY sequences (match or not match) • e.g. probability of obtaining HTHTHHHH in a coin toss experiment Dynamic Programming – Statistical Significance

  36. Two random sequences of length m and n and p=prob. of match • Length of matches=log1/p(mn) • DNA seq. length=100, p=0.25 (equal nt) • the longest match = 2 x log4(100)=6.65 • More precise formula Dynamic Programming – Statistical Significance

  37. Simpliying k=mismatches, m and n are sequence length Efective length = n – E(m) (used in BLAST) Dynamic Programming – Statistical Significance (mean of the highest possible local alignment score)

  38. Alignment Procedure Overview

  39. Search a database for sequences that at least share W identicalresidues For a sequence of length L, the number of "internal searches" is L-W+1 All "potential" sequences are then "extended" using the Dynamic Programming Method A statistical significance score is estimated representing the number of expected similar sequences in the database (E value, -equivalent- to a p-value for the entire database) Word k-tuple method - BLAST

  40. Pi – random residue probability • Sij – From score matrix • Score  S=sum(PiPjSij) • Transformation • For statistical comparisons • Expressed in bits • Expected number of matches of at least S’ • Lengths: query=m, database=n • Example: • m=250, n=50,000,000, to achieve E=0.05 • S’ = 38 bits • S = [(38 * ln 2) + ln K] / λ  S = 76.6 BLAST (for ungapped version : λu = 0.3176 and Ku = 0.134

More Related