80 likes | 181 Views
Bioinformatics PhD. Course. Summary (approximate). 1. Biological introduction. 2. Comparison of short sequences (<10.000 bps). 3 Comparison of large sequences (up to 250 000 000). 4 Sequence assembly. 5 Efficient data search structures and algorithms. 6 Proteins.
E N D
Bioinformatics PhD. Course Summary (approximate) • 1. Biological introduction • 2. Comparison of short sequences (<10.000 bps) • 3 Comparison of large sequences (up to 250 000 000) • 4 Sequence assembly • 5 Efficient data search structures and algorithms • 6 Proteins...
2. Comparison of short sequences (<10.000 bps) Summary (more or less) • 2.1 Dot matrix • 2.2 Pairwise alignment. • 2.3 Hash algorithms. • 2.4 Multiple alignment.
2. Dot matrix S2 y S1 x Given two sequences, how we can analyse their degree of identity? By searching those parts that match: 1/0 1 if both characters coincide
2. Dot matrix S2 S2 y y . . . . . S1 S1 x x . . 1/0 1 if both characters coincide ? Given two sequences, how we can analyse their degree of identity? By searching those parts that match:
2.1 Dot matrix accaccacaccacaacgagcata… acctgagcgatat a c c . . t • m(i,j)=1 iff S1(i..i+L)=S2(j..j+L): exact matching • m(i,j)=1 iff k over L coincide: approximate matching. • m(i,j)=k iff k over L coincide: approximate matching L=window length What is the cost of the algorithm? When are the matchings relevant?
2.1. Dot matrix: algorithm cost accaccacaccacaacgagcata… acctgagcgatat a c c . . t • long(S1)*long(S2)* L in other words O(n2 L) • can long(S1)*long(S2)be possible? • can we also say that O(n2 ) is independent of L?
2.1. Dot matrix: signals C: Random B: S1=S2 A: transposons When are signals statistically significant?
2.1. Dot matrix: statistical significance: Given L=window length S2 y . . . . . S1 x . . We need to define a random model against which to compare the signals: we define RV: X number of characters that coincide, then Prob(X=k)=comb(L,k) pk (1-p)L-k What is its expected value?