380 likes | 484 Views
Recap. Don’t forget to pick a paper and Email me See the schedule to see what’s taken http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html. Agenda. Questions for you (10 minutes) Overview (40 minutes) chromosomes sequence comparison string matching alignment
E N D
Recap • Don’t forget to • pick a paper and • Email me • See the schedule to see what’s taken • http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html
Agenda • Questions for you (10 minutes) • Overview (40 minutes) • chromosomes • sequence comparison • string matching • alignment • Quiz (25+ minutes)
Questions for you • List two different functions performed by genes? • What is the length of the human genome? • Why is the double-helix/base-pairing so important?
Questions for you • Protein sequences are composed of a chain of what? • How many different amino acids are found in proteins? • Proteins always form in a helix shape (True or False)?
Questions that would stump Dr. B. • What is the lower limit on the length of a functional protein? • 10-20 • 40-50 • 60-70 • 100 • What is the upper limit on the length of proteins found in cells • 100’s • 1000’s • 1000000’s
Questions that would stump Dr. B. • What is average length of a human gene? • 300 • 3000 • 30,000 • Approximately, how many genes are in the human genome? • 400 • 4000 • 40,000 • 400,000 • 4,000,000
Acid Sugar Sugar Sugar Sugar Sugar Sugar Sugar Sugar A C A A T T T G Acid Acid Acid Acid Acid Acid Rememberthis picture? Acid
Chromosomes • DNA molecule and associated proteins • The 3,000,000,000 nucleotide human genome is divided among • 22 pairs of autosomes and • 1 pair of sex chromosomes • Together the 23 chromosomes carry all the hereditary information of an organism.
DNA Sequence Comparison • Overview • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Whole Genome Comparison • Problem: Exactly how similar are two different genomes? • Given a set of genomes • which two are most similar • which two are least similar
G2 G5 G4 G3 G1 Whole Genome Comparison • Ranking a set of genomes based on similarity gives us clues about • heredity • evolution Similarity Rank G2 G5 0.99 G3 G1 0.97 G4 G5 0.91 G4 G2 0.90 G4 G1 0.80 G4 G3 0.78 G2 G1 G3 G4 G5
Whole Genome Comparison • Solution: Design a metric that quantifies similarity • something you can measure or • something you can compute • that accurately quantifies similarity
Whole Genome Comparison • But what does it really mean for two genomes to be similar? • Obviously, if two genomes exactly match then they are similar • But, what’s more important • rough, overall similarity, or • exact, local similarity • A picture will explain
Whole Genome Comparison • Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA
GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA Whole Genome Comparison • Rough overall similarity • 2 Mismatched pairs • 4 unmatched nucleotides
Whole Genome Comparison • Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT
Whole Genome Comparison • The first metric: Edit Distance • The number of edit operations needed to make the two sequences equal • Edit Distance was previously used in • Spell checkers • Approximate database searching
Edit Distance • 3 edit operations • delete a symbol • insert a symbol • modify a symbol • modify = delete + insert • modify counts as two edit operations
Edit Distance • What is the edit distance between these two sequences? • Note: edit distance implies the minimum number of basic edit operations needed to make the string equal • ERICWASABIGNERDERICSTILLISANERD • ERICWASABIGNERD (5 deletions) • ERICSTILLISANERD (6 deletions)
Edit Distance • ERICWASABIGNERD (15 symbols)ERICSTILLISANERD (16 symbols) • ERICWASABIGNERD (5 deletions)ERICSTILLISANERD (6 deletions) • Metrics • Matches 10 / Smaller Sequence 15 = 66% • (Edits 11 – Symbols 31) / Symbols 31 = 64%
Edit Distance • There are problems with edit distance • It doesn’t properly reward exact local similarity • which is often a true sign of biological similarity • Similar organisms often share a lot of similar genes • But may have a few genes that don’t match at all • Biologists need a metric that can reflect this type of situation
Edit Distance • Another problem • Two organisms might have almost identical DNA • Except one has extra segments • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%
Edit Distance • How is it possible that two metrics based on the same principle (edit distance) could produce such different results? • Metrics • Matches 99 / Smaller Sequence 100 = 99% • (Edits 50 – Symbols 250) / Symbols 250 = 80%
Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Gene Search • Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. • They want to know • Is this segment a gene? • Advantage: Genes are similar across different organisms. • Two organisms that do the same exact function are likely to have a nearly-exact gene.
Gene Search • Solution: • Take your newly sequenced segment • And search all the previously sequenced genomes. • Find segments (in other genomes) that highly match your segment. • Advantage: • Other genomes are marked-up • Segments that are known to be genes are labeled • If your segment matches a known gene then BAM! • You’ve found a gene in a previously un-sequenced organism.
Gene Search • Obviously, you want to search for a segment that is highly similar to your target segment. • However, this type of comparison is completely different than whole genome comparison • What is the fundamental difference?
Gene Search vs. Whole Genome Comparison • Whole genome comparison considers sequences in their entirety • Two sequences • Beginning to End
Gene Search vs. Whole Genome Comparison • Gene search doesn’t consider the entire search sequence when evaluating similarity • Two sequences • Target (the segment you sequenced) • Search Sequence (possibly a genome)
Gene Search • You want to find a sub-segment of the search sequence that highly matches the target sequence. • The entire search sequence is analyzed • But in evaluating similarity, we don’t need to consider the search sequence in its entirety • Looking for localized similarity
Gene Search • How do you even know that your newly sequenced segment is a gene? • Perhaps only part of it is a gene and the rest is junk.
Gene Search • Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. • Writing an algorithm to find such matches is hard
Gene Search • Writing such algorithms required coordination between • Biologists • Who have some clues about true biological similarity • And Computer Scientists • Who have some clues about what problems can be solved efficiently and reliably.
Recall • There are 3 different types of comparisons that are important • Whole genome comparison • Gene search • Motif discovery (shared pattern discovery)
Next Class • Motif discovery (computer science perspective) • Alignment (the technique used to measure similarity) • Global alignment • Local alignment • Scoring matrices
Homework • Pick a paper! Email me. • Read pages 159-172