270 likes | 290 Views
Welcome to Introduction to Bioinformatics. I. Scenario 4: Sequence alignment Bring up course web site Go to Scenario 4 Open the first sequence alignment notes. Scenario 3: Our Story. You: Our first defense at CDC. Outbreak:. . . . Anthrax?. Samples:. Confirm agent.
E N D
Welcome toIntroduction to Bioinformatics • I. Scenario 4: Sequence alignment • Bring up course web site • Go to Scenario 4 • Open the first sequence alignment notes
Scenario 3: Our Story You: Our first defense at CDC Outbreak: . . . Anthrax? Samples: • Confirm agent • Identify strain
Toxin gene-specific primers Scenario 3: Our Story
PCR Scenario 3: Our Story If DNA from bacterium with toxin gene If DNANOTfrom bacterium with toxin gene?
PCR Scenario 3: Our Story If DNA from bacterium with toxin gene If DNANOTfrom bacterium with toxin gene? (no product)
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG >gi|16031490|emb|AJ413935.1|BAN413935 Bacillus anthracis partial lef gene, isolate Microsoft-6259 Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus Query: 1 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326 Query: 61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386 Query: 121 acagataatacaaaaattaatcgaggtatattcaatga 158 |||||||||||||||||||||||||||||||||||||| Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424 Scenario 3: Our Story DG47
PCR Toxin gene present Scenario 3: Our Story
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Do it! Scenario 3: Our Story DG47
Scenario 3: Our Story Maybe it’s not from the toxin gene??
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Translate NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS Do it! Scenario 3: Our Story DG47
DG47 nucleotide sequence: Matches nothing in GenBank DG47 amino acid sequence: 100% match to toxin gene
Do it! Scenario 3: Our Story Compare nucleotide sequences by hand DG47vslef
Scenario 3: Our Story Compare nucleotide sequences by hand DG47 1 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831 AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG DG47 61 TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC |||||||| |||||||| |||||| | |||||||| ||||||| |||||||| ||||||lef gene 1891 TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC DG47 121 ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC || |||||||| ||||||||| ||||||| |||||||| |||||||||||||||||||||lef gene 1951 ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT DG47 181 AGTATTTCTA |||||||||| lef gene 2011 AGTATTTCTA 89% identical!
AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found Scenario 3: Our Story Compare nucleotide sequences by hand DG47 +lef gene
DG47 1 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831 AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG 89% identical! Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found Scenario 3: Our Story Why can’t Blast figure outwhat you can plainly see?
Scenario 3: How does Blast work? • Clearly we need to understand more about how • sequence alignment really works! • Theory behind nucleotide vs nucleotide Blast • Working BlastN program • Theory behind protein-protein Blast • How to get Blast to do what you want
“Flavours” of sequence alignment Global Alignment - Needleman-Wunsch algorithm - Compares two sequences across their whole length - Mostly only useful when you already know sequences might be similar - Not useful for comparing a short query to an entire genome. - Not discussed further in this class. Local Alignment - Allows alignment of subsequences of the target and the query • Usually what we want ; the query can be searched against entire genomes or large databases.
Crude Local Alignment Methods The “Dot Matrix” method (Gibbs and McIntyre, 1970) Represents the query and target sequences as a matrix ( a two-dimensional array) using a sliding window of similarity The human eye can powerfully distinguish the identity line from the noise
The “Dot Matrix” method (Gibbs and McIntyre, 1970) Normally a “window size” and “stringency” are specified i.e. if the window size is 8 and stringency is 6, a dot is only placed if at least 6 of the current 8 positions in the query match the target
The “Dot Matrix” method (Gibbs and McIntyre, 1970) G G T A A T A G window = 2 stringency = 2 G T A A T A
Problems with the Dot Matrix method • Requires human supervision! • A memory and processor time pig (a complete m*n matrix is calculated each time) • No explicit handling of gaps • No good quantitative score of alignment quality
The Smith-Waterman Algorithm (no gaps version) G G T A A T A G 1 1 Match Extension = +1 NoMatch Penalty = -2 G 1 2 3 T 1 A 4 1 2 Negative values are reset to zero!! C 2 1 3 T Download SmithWaterman1.py A 2 1 4
Smith Waterman – Dynamic Programming An optimal alignment can be found starting from the highest scoring box and working backwards. Dynamic Programming is a method for recording the solutions to subproblems, then working backwards to find an overall solution. If we incorporate gaps, we must start keeping track of this “traceback” pathway.
2 -2 -2 The Smith-Waterman Algorithm (with gaps) G G T A A T A Match Extension = +1 NoMatch Penalty = -2 Gap Penalty = -3 G 1 1 G 1 2 3 Take the Max of: 0;adding Query Gap; adding Target Gap; Match/No match; T A 4 1 C 1 T Download SmithWaterman2.py A
Problems with Smith-Waterman Still a pig! Memory and processor time requirements are huge when the query and/or the database gets large….. (a complete m*n matrix is still calculated each time!!) Do we really need to calculate the whole matrix?
BlastN – “word” based heuristics Notice that in a typical S-W matrix, most of the boxes are empty!!! What if we find exact matches of some seed words, then just work in the area surrounding these seeds trying to extend the alignment? This is exactly the heuristic that blast employs to avoid calculating the whole matrix! (see figure on page 6 of Alignment notes)
BlastN Procedure Filter the query sequence for repetitive “low complexity” sequences Identify the subsequences of size word in the query Find the exact matches in the target of the all the words Use a modified S-W to extend the hits around the seed words Score and report on the best matches More on scoring on next class!!!