1 / 38

Recap

Recap. Don’t forget to pick a paper and Email me See the schedule to see what’s taken http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html. Agenda. Questions for you (10 minutes) Overview (40 minutes) chromosomes sequence comparison string matching alignment

Download Presentation

Recap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recap • Don’t forget to • pick a paper and • Email me • See the schedule to see what’s taken • http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

  2. Agenda • Questions for you (10 minutes) • Overview (40 minutes) • chromosomes • sequence comparison • string matching • alignment • Quiz (25+ minutes)

  3. Questions for you • List two different functions performed by genes? • What is the length of the human genome? • Why is the double-helix/base-pairing so important?

  4. Questions for you • Protein sequences are composed of a chain of what? • How many different amino acids are found in proteins? • Proteins always form in a helix shape (True or False)?

  5. Questions that would stump Dr. B. • What is the lower limit on the length of a functional protein? • 10-20 • 40-50 • 60-70 • 100 • What is the upper limit on the length of proteins found in cells • 100’s • 1000’s • 1000000’s

  6. Questions that would stump Dr. B. • What is average length of a human gene? • 300 • 3000 • 30,000 • Approximately, how many genes are in the human genome? • 400 • 4000 • 40,000 • 400,000 • 4,000,000

  7. Acid Sugar Sugar Sugar Sugar Sugar Sugar Sugar Sugar A C A A T T T G Acid Acid Acid Acid Acid Acid Rememberthis picture? Acid

  8. Chromosomes • DNA molecule and associated proteins • The 3,000,000,000 nucleotide human genome is divided among • 22 pairs of autosomes and • 1 pair of sex chromosomes • Together the 23 chromosomes carry all the hereditary information of an organism.

  9. Chromosomes

  10. DNA Sequence Comparison • Overview • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)

  11. Whole Genome Comparison • Problem: Exactly how similar are two different genomes? • Given a set of genomes • which two are most similar • which two are least similar

  12. G2 G5 G4 G3 G1 Whole Genome Comparison • Ranking a set of genomes based on similarity gives us clues about • heredity • evolution Similarity Rank G2 G5 0.99 G3 G1 0.97 G4 G5 0.91 G4 G2 0.90 G4 G1 0.80 G4 G3 0.78 G2 G1 G3 G4 G5

  13. Whole Genome Comparison • Solution: Design a metric that quantifies similarity • something you can measure or • something you can compute • that accurately quantifies similarity

  14. Whole Genome Comparison • But what does it really mean for two genomes to be similar? • Obviously, if two genomes exactly match then they are similar • But, what’s more important • rough, overall similarity, or • exact, local similarity • A picture will explain

  15. Whole Genome Comparison • Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA

  16. GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA Whole Genome Comparison • Rough overall similarity • 2 Mismatched pairs • 4 unmatched nucleotides

  17. Whole Genome Comparison • Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT

  18. Whole Genome Comparison • The first metric: Edit Distance • The number of edit operations needed to make the two sequences equal • Edit Distance was previously used in • Spell checkers • Approximate database searching

  19. Edit Distance • 3 edit operations • delete a symbol • insert a symbol • modify a symbol • modify = delete + insert • modify counts as two edit operations

  20. Edit Distance • What is the edit distance between these two sequences? • Note: edit distance implies the minimum number of basic edit operations needed to make the string equal • ERICWASABIGNERDERICSTILLISANERD • ERICWASABIGNERD (5 deletions) • ERICSTILLISANERD (6 deletions)

  21. Edit Distance • ERICWASABIGNERD (15 symbols)ERICSTILLISANERD (16 symbols) • ERICWASABIGNERD (5 deletions)ERICSTILLISANERD (6 deletions) • Metrics • Matches 10 / Smaller Sequence 15 = 66% • (Edits 11 – Symbols 31) / Symbols 31 = 64%

  22. Edit Distance • There are problems with edit distance • It doesn’t properly reward exact local similarity • which is often a true sign of biological similarity • Similar organisms often share a lot of similar genes • But may have a few genes that don’t match at all • Biologists need a metric that can reflect this type of situation

  23. Edit Distance • Another problem • Two organisms might have almost identical DNA • Except one has extra segments • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%

  24. Edit Distance • How is it possible that two metrics based on the same principle (edit distance) could produce such different results? • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%

  25. Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)

  26. Gene Search • Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. • They want to know • Is this segment a gene? • Advantage: Genes are similar across different organisms. • Two organisms that do the same exact function are likely to have a nearly-exact gene.

  27. Gene Search • Solution: • Take your newly sequenced segment • And search all the previously sequenced genomes. • Find segments (in other genomes) that highly match your segment. • Advantage: • Other genomes are marked-up • Segments that are known to be genes are labeled • If your segment matches a known gene then BAM! • You’ve found a gene in a previously un-sequenced organism.

  28. Gene Search • Obviously, you want to search for a segment that is highly similar to your target segment. • However, this type of comparison is completely different than whole genome comparison • What is the fundamental difference?

  29. Gene Search vs. Whole Genome Comparison • Whole genome comparison considers sequences in their entirety • Two sequences • Beginning to End

  30. Gene Search vs. Whole Genome Comparison • Gene search doesn’t consider the entire search sequence when evaluating similarity • Two sequences • Target (the segment you sequenced) • Search Sequence (possibly a genome)

  31. Gene Search • You want to find a sub-segment of the search sequence that highly matches the target sequence. • The entire search sequence is analyzed • But in evaluating similarity, we don’t need to consider the search sequence in its entirety • Looking for localized similarity

  32. Gene Search • How do you even know that your newly sequenced segment is a gene? • Perhaps only part of it is a gene and the rest is junk.

  33. Gene Search • Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. • Writing an algorithm to find such matches is hard

  34. Gene Search • Writing such algorithms required coordination between • Biologists • Who have some clues about true biological similarity • And Computer Scientists • Who have some clues about what problems can be solved efficiently and reliably.

  35. Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)

  36. Next Class • Motif discovery (computer science perspective) • Alignment (the technique used to measure similarity) • Global alignment • Local alignment • Scoring matrices

  37. Homework • Pick a paper! Email me. • Read pages 159-172

More Related