1 / 28

Hashing Algorithm and its Applications in Bioinformatics

Hashing Algorithm and its Applications in Bioinformatics. By Zemin Ning. Informatics Division The Wellcome Trust Sanger Institute. Outline of the Talk:. Research Background SSAHA – The Fastest Sequence Search Engine - Hash table; - Sequence search based on the hash table;

uri
Download Presentation

Hashing Algorithm and its Applications in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute

  2. Outline of the Talk: • Research Background • SSAHA – The Fastest Sequence Search Engine - Hash table; - Sequence search based on the hash table; - Various applications. • Euler Path – consensus generation - Euler Path; - Consensus generation; - SNP calling. • Phusion – the WGS assembler: - Phusion pipeline; - Reads grouping; - Applications. • Current Research

  3. Powder Simulation

  4. Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

  5. Sequence Search and Alignment • Algorithms - Dynamic programming; - Suffix tree; - Hash method; - … • Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - … • CPU vs Memory

  6. Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (i) To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (ii) To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (iii) To provide possible tools for sequence analysis based on the search engine.

  7. Automatic Sequencing ATGCAGGTCC …….

  8. Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m K-tuple: (sisi+1...si+k-1) “A” =00; “C” = 01; “G” = 10; “T” = 11 SSAHA Index: Sequence Representation Using two binary digits for each base, we may have the following representations: For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.

  9. E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Hash Table: A 2-tuple hashing table of S1, S2 and S3 S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA)

  10. E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Query sequence: Sq = (TGCAACAT)

  11. Query sequence: k-tuples f(t) F(t) -(t-1) Fs(t) TG 1, 13 1, 13 0 1, 5 2, 7 2, 7 0 1, 13 3, 9 3, 9 0 2, -2 GC -1 CA 2, 3 2, 1 -2 2, 1 2, 9 2, 7 -2 2, 1 2, 21 2, 19 -2 2, 4 2, 27 2, 25 -2 2, 7 2, 33 2, 31 -2 2, 7 3, 21 3, 19 -2 2, 7 3, 23 3, 21 -2 2, 7 AA 2, 19 2, 16 -3 2, 16 AC 1, 9 1, 5 -4 2, 16 2, 5 2, 1 -4 2, 19 2, 11 2, 7 -4 2, 21 CA 2, 3 2, -2 -5 2, 25 2, 9 2, 4 -5 2, 28 2, 21 2, 16 -5 2, 31 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18 AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21 Array of index and offset data Sq = (TGCAACAT)

  12. Index Offset 64 Bit Machines In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F(t) = {H(E(t),1), H(E(t),2),…, H(E(t),Nt)} with H(E(t),i) = 232 H1(E(t),i) + H2’(E(t),i) i = 1,2,…, Nt It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer.

  13. Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort. Power Law: CPU time v query length

  14. Memory for subject: Ms = 4*Ns/k+ 4*22k Memory for query: Mq = Nq House keeping: 10-20% total Total memory: Ms = 1.2*(Ms+Mq) SSAHA Memory

  15. ? SSAHA2 SSAHA2 Client Client ? The SSAHA Trace Server It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.

  16. a b a d . d . . Pregel River b c . c The Seven Bridges of Konigsberg • During the 18th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions. • Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?

  17. a f e b d c Vertex Degree, Euler Circuit and Euler Path • Vertex degree: For an undirected graph G, the vertex degree is defined as the number of edges in the graph. • Euler circuit: For an undirected graph G, if there is a circuit in G that traverses every edge of the graph exactly once, then G is said to have an Euler circuit. • Euler path: If there is an open trail from a to c in G and this trails traverses each edge in G exactly once, the the trail is called an Euler trail or Euler path.

  18. Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

  19. CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

  20. E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

  21. E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 Point to the Next - Hash Table Links S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

  22. ATGC--AGGTCC AT--C--AGGTCC ATGCTAGGTCC ATGC--AGTTCC ATGC--AGGTCC Consensus ATG ->TGC -> GCA ->CAG -> AGG ->GGT -> GTC ->TCC CONS=(ATGCAGGTCC)

  23. ATGC--AGGTCC ATGC--AGGTCC ATTCCAGGTCC ATTC--AGCTCC ATGCTAGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGC--AGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGCTAGGTCC ATGCTAGGTCC eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly occurred base pair on the location.

  24. Assembly Data Process Shotgun Reads Supercontig FPC Mapping Read-pair Tracker PRono RPjoin –Merge Reads Group RPphrap - Contig Phusion Assembler Pipeline

  25. ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

  26. Zebrafish as a model organism • Danio rerio • Fish length: 3 cm long Estimated genome size: 1.55 Gb • Easy to maintain short generation time can be kept at high densities • Easy to manipulate external fertilisation and development transparent embryos Sanger Institute WGS project started in spring 2001 • DNA source Tuebingen embryos; • WGS read Insert sizes: 2 - 10 kb; • BACends insert sizes: 165 – 175 kb; • Polymorphism: ~ 1000 5 day old embryos; • SNP density: One in every 200 bps; • Indel density: One in every 1500 bps; • Indel length: 2 – 30 bps.

  27. Acknowledgements: • Jim Mullkin • Yong Gu • Adam Spargo • Richard Durbin • Kerstin Jekosch • Sean Humphray • Jane Rogers • Sanger Systems Support • Sanger Sequencing Facilities

More Related