1 / 47

For audio portion of webcast please dial:

For audio portion of webcast please dial: +44 ( 0 )870 22 333 65 (please omit zero if calling from outside the UK ) PIN = 444888. Personal Introductions. Robert Hercus - MD and Inventor, Synamatix Over 30 years IT experience Pioneered many large-scale IT projects

jaimie
Download Presentation

For audio portion of webcast please dial:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. For audio portion of webcast please dial: +44 (0)870 22 333 65 (please omit zero if calling from outside the UK) PIN = 444888

  2. Personal Introductions Robert Hercus - MD and Inventor, Synamatix • Over 30 years IT experience • Pioneered many large-scale IT projects • “Language of Biology” basis of Synamatix • Interests: Linguistics, Genomics, Artificial Intelligence Ali Zamli – Bioinformatician • Research Scientist • Synamatix applications development Dr. Arif Anwar – VP, Synamatix • 10 yrs+ post-Ph.D. US and EU genomics background • Ex – Agilent, CLONTECH and Axon Instruments

  3. Questions to answer today? • What is a SynaBASE? • What are the advantages of using SynaBASE? • In which situations has SynaBASE been applied to? • Does the use of SynaBASE offer any advantages for phylogenetics?

  4. Core IP - SynaBASE™ - PLATFORM • Main partners and users in US and EU • 50+ staff split across group • Open approach to development – engine not software • Focused on efficient HPC for Genomics and Life Sciences

  5. COREDatabase platform API calls Graphical Interface Command line interface Applications Data analysis Develop Tools SXParse SXSequenceRefs SynaSearch Bulk SXLRESearch SynaRex Bulk SXFuzzyPatternSearch SynaProbe Bulk SXAlign Sxpet

  6. Software policy • More than 40 existing applications • All open source to licensees of SynaBASE • Users can also develop, modify and share all applications

  7. Similarity & association Common PATTERNS and functionality What do we know about data ?

  8. Pattern Trie A C T AA AC CT TC AAC ACT CTC More memory efficient than variable length data structures AACT ACTC Going to leaf node finds all sources and positions AACTC

  9. Pattern Trie A C T AA AC CT TC AAA AAC ACT CTC Low complexity repeats - filtered AAA AACT ACTC f=100 f=20 High frequency patterns removed from alignment seeding AACTC

  10. Building a SynaBASE – easy and fast

  11. Takes 8 minutes for Swissprot The fields in the build form are equivalent to the command-line XML configuration Fields data is converted into XML format and added to the existing entry in the Synabase XML configuration file

  12. Pattern Trie A C T AA AC CT TC AAC ACT CTC AACT ACTC Trie Boundary Frequency is greater than build limit AACTC

  13. Flexibility to use CMD line

  14. Single-server IT architecture SynaBASE & SynaSuite Server HP Integrity rx4640 server Dual Intel Itanium2 1.5GHz CPU 64 GB DDR memory 146GB Ultra320 SCSI hard disk x 2 Red Hat Enterprise Linux AS 3 for IA64

  15. 1. SynaBASE scales efficiently

  16. 2. SynaBASE enables very fast access A C T • Number of levels small • For a query: • Match 1st longest pattern • Follow Eulerian path through network, picking up longest matching pattern for each posn. In query • Processing time is: • Proportional to query size to obtain all unique subpatterns AA AC CT TC AAC ACT CTC CTCG TCGA AACT ACTC AACTC ACTCG CTCGA

  17. Efficiency leads to high performance Only 15million nodes are needed to represent 56million residues The storage of the shorter nodes has little effect

  18. 3. SynaBASE is very fast - Q* logN base A Speed milliseconds 900 800 Conventional 700 600 SynaBASE 500 400 300 200 100 Size of database mega bp 1 10 100 1000

  19. Novel hits BLASTN vs. SynaSearch-Bulk • Cumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences

  20. 4. Novel annotation using SynaBASE The elephant and the giraffe walked up the mountain A graph showing Frequency of  “string (word)” patterns in a sentence does not reflect meaning The elephant and the giraffe walked up the mountain A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning

  21. a1 a2 a3 a2a3 a1a2 a1a2a3 Expected Frequency SIGNIFICANCE Sig(a1a2a3) = F(a1a2a3) / Ef(a1a2a3) = Fr(a1a2a3) * F(a2) F(a1a2) * F(a2a3) Ef(a1a2a3) = F(a1a2) * F(a2a3) F(a2) Actual Freq/Expec Freq

  22. Gene models correlate with “SIGNIFICANCE” Ensembl Gene F2 F3 PIM1 Oncogene

  23. Example 1 - 454 assembly result • 400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correction • Genome coverage 99.89%

  24. FragBASE – using the SynaBASE structure…. Use corrected FragBASE Select patterns of high coverage Use FragBASE network* to extend patterns Increase pattern size to overcome shorter repeat sections

  25. Example 2 - Microarrays • Probe design – 30000 75mer probes, 8 per gene in 8h compared to previous 3 month+ process • Probe evaluation and mapping • Mapping of 600,000 Affymetrix 25mer probes to Human genome in 17s • Compares to over 2 weeks with BLAST

  26. 3 yrs Example 3 – Comparative Genomics 22days 6h SynaBASE PatternHunter BLAST

  27. Example 4 – Genome mapping • Aims: • Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools. • Compare sensitivity, specificity and performance advantages of Synamatix technologies . • Results: • In comparison to BLASTz, SynaSearch: • Is 219 fold faster • Finds 11% more true positives • Finds 17% more unique hits to queries • Has a higher specificity: • 113% fewer false positives • fewer multiple placements per read – 2.7 v 5.3 • Benefits: • Enables significant enhancements in workflow throughput. • 219 fold compute time improvement • SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors. • Results in better assemblies of new genomes. • Reduces current reliance on outsourcing of BLASTz analysis.

  28. “Inference of a phylogenetic network of whole prokaryotic genomes using SynaBASE” Further example of use of SynaBASE engine: applying SynaBASE to Phylogenetics

  29. Outline of study • Primary data set 1: 101 Bacterial and Archaeal Genomes • Used “SynaTree” – exhaustive comparison between “Sequences” in SynaBASE structure • Generates phylogenetic tree • Used prototype Synamatix application: “SXComparePattern” – exhaustive pattern based similarity matching • Evaluation of methods using: • C-score method* • Group visualisation and clustering analysis • Tested “SXComparePattern” method with a larger 488 Bacterial Genome data set *Henz S.R., Huson D.H., Auch A.F. Struwe K.N-. and Schuster S.C. (2005) Whole-genome prokaryotic phylogeny. Bioinformatics. 21(10): 2329-2335

  30. Where: A = alignment score L = length of respective genomes The distance matrix is used to generate a phylogenetic tree Phylogenetics using SynaTree • For each query genome, can search SynaBASE for all alignments with all other genome sequences {srefs, posn, length} • The alignment scores can then be used to calculate a distance matrix:

  31. SynaTree Interface

  32. It can be seen from the chart that the resulting triplet in a sliding window include significant alignments and also spurious short matches that are not significant. The SynaBASE align function, SXAlign, includes a filter to remove the random short alignments or 'noise' from the alignment data. The alignment scores are then used to calculate a distance matrix SynaTree uses SXAlign API for comparing alignments SynaTree uses the SXAlign API for comparing alignments

  33. Chart shows the effect of using diagonal alignment filter on the alignment of 2 Serine Kinase aa sequences Example of filtering

  34. SynaTree for 101 bacterial & archaeal genomes 95 minutes! Compared to 7 days with BLAST

  35. Cyanobacteria Firmicute Chlamydiae SynaTree for 101 bacterial & archaeal genomes

  36. Frequency of each pattern 2nd method: SXComparePattern Raw score for patterns Calculation of distance matrix from raw score by distance formula

  37. Where: A =shared patterns between genomes i and j L= number of patterns for respective genomes Here, the calculation is based on shared patterns between each genomic sequences SXComparePattern Approach • Distance matrix calculated is the same as before with some exceptions:

  38. SXComparePattern tree for 101 bacterial and archaeal genomes 23seconds! Compared to 7 days with BLAST

  39. Chlamydiae Cyanobacteria Firmicute SXComparePattern tree for 101 bacterial and archaeal genomes

  40. Perfomance based on grouping

  41. Which is essentially a sum of compatible non-trivial splits (Tc) divided by the sum of all non-trivial splits in the test tree Assumption is that the compatability of non-trivial splits is compared against a reference tree which is deemed 'correct'. Evaluation of phylogenetic networks • Evaluation of phylogenetic networks based on c-score proposed by Henz, et al. (2005)

  42. NCBI Reference Tree

  43. Zoomed tree of 488 Bacterial Genomes

  44. SXComparePattern highlighted above and marked with * is with 488 bacterial sequences Performance comparison • Rapid method for inferring phylogenetic networks.

  45. Summary • SynaBASE platform extensible to phylogenetics • Pattern based approach provides for a very rapid and scalable means of clustering genomes into phylogenetic networks • Enables multi-supercomputer performance from a single server • This same approach can be used to cluster and analyse previously improbable data sets, e.g. • All primate genomes • All genes • Iterative analysis of evolutionary phylogenetics

  46. END OF WEBCAST • Thank you for your participation! • Next Webcast will be on April 30 – “Use of SynaBASE for assembly of reads from 454 Life Sciences sequencing platform” • A full paper of the work presented will be sent to you on Monday next week • Please email: enquiries@synamatix.comif you have any questions or would like a free trial

More Related