1 / 35

Bioinformatics: Perl Programming and Sequence Analysis Overview

Learn Perl programming, Unix basics, Viterbi algorithm, sequence analysis, Pairwise Alignment, HMMs, BLAST, and more in this course. Recommended texts included.

bmayle
Download Presentation

Bioinformatics: Perl Programming and Sequence Analysis Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 1 BNFO 601 Usman Roshan

  2. Course overview • Perl progamming language (and some Unix basics) • Unix basics • Intro Perl exercises • Dynamic programming and Viterbi algorithm in Perl • Sequence analysis • Algorithms for exact and heuristic pairwise alignment • Hidden Markov models • BLAST • Short read alignments • Whole genome sequence alignment • High performance methods for very large datasets

  3. Overview (contd) • Grade: Two 25% exams, 30% final exam, and six programming assignments (20%) • Exams will cover Perl and bioinformatics algorithms • Recommended Texts: • Introduction to Bioinformatics Algorithms by PavelPevzner and Neil Jones • Biological sequence analysis by Durbin et. al. • Introduction to Bioinformatics by Arthur Lesk • Beginning Perl for Bioinformatics by James Tisdall • Introduction to Machine Learning by EthemAlpaydin

  4. -3 mil yrs AAGACTT AAGACTT AAGACTT AAGACTT AAGACTT -2 mil yrs AAGGCTT AAGGCTT AAGGCTT AAGGCTT T_GACTT T_GACTT T_GACTT T_GACTT -1 mil yrs _GGGCTT _GGGCTT _GGGCTT TAGACCTT TAGACCTT TAGACCTT A_CACTT A_CACTT A_CACTT today _G_GCTT (Mouse) TAGGCCTT (Human) TAGCCCTTA (Monkey) A_CACTTC (Lion) A_C_CTT (Cat) GGCTT (Mouse) TAGGCCTT (Human) TAGCCCTTA (Monkey) ACACTTC (Lion) ACCTT (Cat) Nothing in biology makes sense, except in the light of evolution

  5. Representing DNA in a format manipulatable by computers • DNA is a double-helix molecule made up of four nucleotides: • Adenosine (A) • Cytosine (C) • Thymine (T) • Guanine (G) • Since A (adenosine) always pairs with T (thymine) and C (cytosine) always pairs with G (guanine) knowing only one side of the ladder is enough • We represent DNA as a sequence of letters where each letter could be A,C,G, or T. • For example, for the helix shown here we would represent this as CAGT.

  6. Transcription and translation

  7. Amino acids Proteins are chains of amino acids. There are twenty different amino acids that chain in different ways to form different proteins. For example, FLLVALCCRFGH (this is how we could store it in a file) This sequence of amino acids folds to form a 3-D structure

  8. Protein folding

  9. Protein folding • The protein folding • problem is to determine • the 3-D protein structure • from the sequence. • Experimental techniques • are very expensive. • Computational are cheap • but difficult to solve. • By comparing sequences • we can deduce the • evolutionary conserved • portions which are also • functional (most of the time).

  10. Protein structure • Primary structure: sequence of • amino acids. • Secondary structure: parts of the • chain organizes itself into alpha helices, beta sheets, and coils. Helices and sheets are usually evolutionarily conserved and can aid sequence alignment. • Tertiary structure: 3-D structure of entire chain • Quaternary structure: Complex of several chains

  11. Key points • DNA can be represented as strings consisting of four letters: A, C, G, and T. They could be very long, e.g. thousands and even millions of letters • Proteins are also represented as strings of 20 letters (each letter is an amino acid). Their 3-D structure determines the function to a large extent.

  12. Comparison of sequences • Fundamental in bioinformatics • Many applications • Evolutionary conserved regions • Functional sites in proteins • Phylogeny reconstruction • Protein structural contact prediction • Gene splicing • Genome assembly • Database search

  13. Pairwise sequence alignment • Comparison of two sequences • We want to determine the substitutions, insertions, and deletions that occurred between the two sequences

  14. Pairwise alignment • Definition of a sequence alignment

  15. Pairwise alignment

  16. Pairwise alignment • X: ACA, Y: GACAT • Match=8, mismatch=2, gap-5 ACA-- -ACA- --ACA ACA---- GACAT GACAT GACAT G—ACAT 2+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5 Score =-4 14 -4 -28

  17. Traceback • We can compute an alignment of DNA (or protein or RNA) sequences X and Y with a traceback matrix T. • Sequence X is aligned along the rows and Y along the columns. • Each entry of the matrix T contains D, L, or U specifying diagonal, left or upper

  18. Traceback • X: ACA, Y=TACAG

  19. Traceback • X: ACA, Y=TACAG

  20. Tracebackpseudocode aligned_seq1 = "" aligned_seq2 = "" i = len(seq1) j = len(seq2) while(i !=0 or j != 0): if(T[i][j] == “L”): aligned_seq1 = “-” + aligned_seq1 aligned_seq2 = seq2[j-1] + aligned_seq2 j = j - 1 elif(T[i][j] == "U"): aligned_seq2 = "-" + aligned_seq2 aligned_seq1 = seq1[i-1] + aligned_seq1 i = i - 1 else: aligned_seq1 = seq1[i-1] + aligned_seq1 aligned_seq2 = seq2[j-1] + aligned_seq2 i = i - 1 j = j - 1

  21. Optimal alignment • An alignment can be specified by the traceback matrix. • How do we determine the traceback for the highest scoring alignment? • Needleman-Wunsch algorithm for global alignment • First proposed in 1970 • Widely used in genomics/bioinformatics • Dynamic programming algorithm

  22. Needleman-Wunsch • Input: • X = x1x2…xn, Y=y1y2…ym • (X is seq1 and Y is seq2) • Notation: • X1..i = x1x2…xi • V(X1..i,Y1..j) = Optimal alignment score of sequences X1..i and Y1..j. • Suppose we know the optimal alignment scores of • X1…i-1 and Y1…j-1 • X1…i and Y1...j-1 • X1...i-1 and Y1…j

  23. Needleman-Wunsch • Then the optimal alignment score of X1…i and Y1…j is the maximum of • V(X1…i-1,Y1…j-1) + match/mismatch • V (X1…i,Y1…j-1) + gap • V (X1…i-1,Y1…j) + gap • We build on this observation to compute V(X1...n,Y1...m)

  24. Needleman-Wunsch • Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns • Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j. • Let m be the match cost, mm be mismatch, and g be the gap cost.

  25. NW pseudocode Initialization: for i = 1 to length of seq1 { V[i][0] = i*g; } For i = 1 to length of seq2 { V[0][i] = i*g; } Recurrence: for i = 1 to length of seq1{ for j = 1 to length of seq2{ V[i-1][j-1] + m(or mm) V[i][j] = max { V[i-1][j] + g V[i][j-1] + g if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’ else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’ else then T[i][j] = ‘L’ } }

  26. NW example Input: seq1: ACA seq2: GACAT m = 5 mm = -4 gap = -20 seq1 is lined along the rows and seq2 is along the columns V G A C A T A C A T

  27. NW parameters • How do we pick gap penalty and match/mismatch costs? • Proteins/RNA both have 3-D structure. • We can produce high quality manual alignments by hand if the structure is available. • These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

  28. Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

  29. Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Computer generated aligned proteins Structural alignment - example 2 Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

  30. Benchmark alignments • Protein alignment benchmarks • BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. • Proteins benchmarks are generally large and have been in the research community for sometime now. • BAliBASE 3.0

  31. NW parameters • With a procedure called cross-validation we use the benchmark to train gap parameters so that the alignment program produces correct alignments. • How can we quantify the “correctness” of a test alignment with respect to a reference? • We measure the number of pairs in the test that are also aligned in the reference alignment. • This is also called the sum-of-pairs accuracy.

  32. NW parameters • For match and mismatch costs we use a more direct approach: log likelihood scores • We calculate probabilities from the benchmark • The same cannot be done for gap penalties because in benchmarks because of uncertainties in gap placements

  33. Biologically realistic scoring matrices • PAM and BLOSUM are most popular • PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins • BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

  34. Computing scoring matrices • Start with a set of reference alignments • Suppose we want to compute the score of A aligning to C • Count the number of times A aligns to C • Count the number of A’s and C’s • Compute pAC the probability of A aligning to C and pA and pC the background probabilities of A and C • Compute the log likelihood ratio

  35. Next week • Basics of Unix • Perl programming • Basics • Exercises • Dynamic programming solution to sequence alignment in Perl

More Related