1 / 34

Bioinformatics PhD. Course

Bioinformatics PhD. Course. Summary (approximate). 1. Biological introduction. 2. Comparison of short sequences (<10.000 bps). 3 Comparison of large sequences (up to 250 000 000). 4 Sequence assembly. 5 Efficient data search structures and algorithms. 6 Proteins.

kateb
Download Presentation

Bioinformatics PhD. Course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics PhD. Course Summary (approximate) • 1. Biological introduction • 2. Comparison of short sequences (<10.000 bps) • 3 Comparison of large sequences (up to 250 000 000) • 4 Sequence assembly • 5 Efficient data search structures and algorithms • 6 Proteins...

  2. 3. Comparison of large sequences Summary (more or less) • 3.1 Overview • 3.2 Suffix trees • 3.3 MUMs

  3. Sequence assembly It has two applications: • DNA sequencing: • determining the bases of a DNA sequence. • EST assembly: • using mRNA fragments to find the genes • expressed in a cell

  4. DNA sequencing Techniques employed: • Hybridization: allows the tuples of a given length • in a sequence to be found. • Shotgun: breaks a sequence into small pieces.

  5. DNA sequencing Techniques employed: • Hybridization: allows the tuples of a given length • in a sequence to be found. • Shotgun: breaks a sequence into small pieces.

  6. Hybridization Imagine we want to determine the sequence xxxxxxxxxxxxx and we know that it contains the following triplets: AAC GAT TGC ACG CGG GCC TTG GGA ATT How can the sequence be established?

  7. Hybridization We create a graph based on suffix-prefix overlaps AAC GAT TGC ACG CGG GCC TTG GGA ATT The sequence is deduced following the path in the graph AACGGATTGCC What is the cost of finding the path?

  8. Hybridization AAC CAA GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT Let us consider a more realistic case: For a general case we find the Hamiltonian path (NP-Complet) What is the cost of the entire hybridization technique?

  9. Hybridization technique: 2. Find the overlaps AAC ACA,... : Cost: 1. Find the L-tuples AAC, CAA, ACG,... : All possible 4L tuples are constructed and searched If there are m pieces of length L, then there are O(m2 L2 ) comparisons 3. Create the graph and find the Hamiltonian path NP- Complet

  10. Note m t = 1 mseg 10m 10t = 10 mseg 1000m 1000t = 1 seg m t = 1mseg. 10m 100t = 100 mseg. 1000m 1000000t = 16 min m t = 1 mseg. 10m 210 t = 1 seg 1000m 21000 t = 1030 t = 1018 years Linear cost: O(m) Quadratic cost: O(m2 ) Exponential cost: O(2m )

  11. Hybridization technique: 2. Find the overlaps AAC ACA,... : Cost: 1. Find the L-tuples AAC, CAA, ACG,... : All possible 4L tuples are constructed and searched If there are m pieces of length L, then there are O(m2 L2 ) comparisons 3. Create the graph and find the Hamiltonian path NP- Complet How can we avoid NP-completeness?

  12. Hybridization: two reductions AAC GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT GA TG GC TT CC AT Find the Hamiltonian path (NP-complete) or find the Eulerian path (linear) AA AC CG GG

  13. Hybridization: Eulerian path Finding the Eulerian path of a graph: Define unbalanced nodes: entry degree = exit degree (Starting or ending nodes: ) Define balanced nodes: entry degree = exit degree (traversal nodes: )

  14. Hybridization: Eulerian path Algorithm: Create a random path from a starting node to an ending node Add circuits at balanced nodes

  15. Hybridization: camí Eulerià Algorithm: Create a random path from a starting node to an ending node Add circuits at balanced nodes

  16. Hybridization technique: 2. Find the overlaps AAC ACA,... : Cost: 1. Find the L-tuples AAC, CAA, ACG,... : All possible 4L tuples are constructed and searched If there are m pieces of length L, then there are O(m2 L2 ) comparisons 3. Create the graph and find the Eulerian path Linear What is the limiting factor?

  17. Hybridization: limitations of the technique AAC CAA GAT TGC ACG CGG GCC TTG GGA ATT GAC Repeated fragments CAACGGATTGCC CAACGGACGGATTGCC What is the probability that a fragment repeats?

  18. Hybridization We estimate the probability that a fragment repeats: Model: random sequence of length N with an equally distribution (1/4), Given 2 fragments, the probability that they are identical: 4-L Given 3 fragment, the probability that two of them are identical: (32)4-L Given m fragment, the probability that two of them are identical: (m2)4-L If L=8 and we want this probability to be 1%, then m =32 Conclusion: the technique of Hybridization can only be applied to short sequences.

  19. Excursió: hipòtesi d’equiprobabilitat Cromosoma 21 té unes 34Mb distribuïdes: A: 30% C: 20% G:20% T:20% i si tenim en compte parells de bases, per exemple AA: 10% AC: 5% Fins a quin punt són equiprobables les seqüències?

  20. Seqüenciació del DNA De quines tècniques es disposa: • Hybridization: permet saber quins mots d’una • longitud fixa es troben a la seqüencia. • Trets: permet disparar sobre la seqüència i • trencar-la en trossos.

  21. Trets Imaginem que volem conèixer la seqüència xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx • i la nostra tècnica ens permet : • copiar-la • partir-la a l’atzar en trossos de diferent llargada i sense saber-ne l’ordre Què podem fer?

  22. Trets: algorisme Imaginem xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxx xxxx|xxxxxx|xxxxxx|xxxxxxx L’algorisme serà: 1er. Comparar tots els trossos dos a dos per esbrinar com es superposen (eliminant inclusions). 2on. Construir el graf sufix-prefix 3er. Buscar el camí

  23. Trets La copiem tres cops xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n’obtenim els trossos accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

  24. Trets Cal comparar els trossos per veure quins engalcen sufix-prefix • Directament amb programació dinàmica (Cost quadràtic) • (tots contre tots i la majoria no engalceran) • En dos passos: • Detectar els que engalcen • (Cost lineal amb l’Algorisme hash) • Aplicar Prog. Dinàmica només als que engalcen

  25. Excursió: algorisme de hash

  26. Trets tacctt accttta tttaac taacga accgtacc acgatac accgt accg gataca i aconseguim la seqüència (cost exponencial) tacaggt accgtacctttaacgatacaggt construïm el graf (cost quadràtic)

  27. Trets: problemes xxxxx xxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxxx accgt xxxxxxx accg xxxxxxx Problemes • Repeticions consecutives • Repeticions curtes llunyanes • Falta de recobriment (problemes al seqüenciar) • Errors en els trossos (problemes al seqüenciar)

  28. Trets: propietats del recobriment Qüestions importants: • Quin és el percentatge de recobriment de la seqüència? • Quin es el nombre esperat de “contigs”? • Quina és la llargada mitja dels “contigs”? Estudiem el recobriment:

  29. Trets: percentatge de recobriment L N d La probabilitat de Prob{X=k}= (d/L)k (1-d/L)n-k que una base de la seqüència sigui recoberta per k segments ve donada per la Dist. Binomial (N,d / L): N k Quin és el percentatge de recobriment de la seqüència? Grau de cobertura de la seqüència N d / L Suposem que els segments estan uniformament distribuïts.

  30. Excursió: distribució binomial Tenim dues urnes: 1-p p amb probabilitats p i 1-p de que hi caigui una bola. Quina és la probabilitat de que d’entre n boles en caiguin k a la primera urna? Prob{X=k}= pk (1-p)n-k n k Distribució binomial B(n,p):

  31. Excursió: distribució de Poisson Quin és el límit de la distribució binomial quan n  i p 0 conservant-se constant el producte np=  Distribució de Poisson P() Prob{X=k}= e- (demostració a classe) k k! Llavors la probabilitat de que almenys caigui una bola és Prob{X>0}= 1-Prob{X=0}= 1- e-

  32. Trets: percentatge de recobriment Distribució Binomial (N ,d / L) Distribució de Poisson (N d / L) N  d/L 0 Llavors el percentatge de recobriment ve donat per la probabilitat de que al menys un tros cobreixi cada punt 1- e(N d / L) Si volem un recobriment del 99% cal que N d / L = 4.6 Si volem un recobriment del 99.9% cal que N d / L = 6.9

  33. Engalçament d’EST Tenim milers de trosso de unes 500 bases de longitud, que pertanyen a diferents L’algorisme serà: 1er. Comparar tots els trossos dos a dos per esbrinar quins estan relacionats(eliminant inclusions). 2on. Construir el graf sufix-prefix: (surten molts petits grafs) 3er. Buscar el camí

More Related