470 likes | 591 Views
For audio portion of webcast please dial: +44 ( 0 )870 22 333 65 (please omit zero if calling from outside the UK ) PIN = 444888. Personal Introductions. Robert Hercus - MD and Inventor, Synamatix Over 30 years IT experience Pioneered many large-scale IT projects
E N D
For audio portion of webcast please dial: +44 (0)870 22 333 65 (please omit zero if calling from outside the UK) PIN = 444888
Personal Introductions Robert Hercus - MD and Inventor, Synamatix • Over 30 years IT experience • Pioneered many large-scale IT projects • “Language of Biology” basis of Synamatix • Interests: Linguistics, Genomics, Artificial Intelligence Ali Zamli – Bioinformatician • Research Scientist • Synamatix applications development Dr. Arif Anwar – VP, Synamatix • 10 yrs+ post-Ph.D. US and EU genomics background • Ex – Agilent, CLONTECH and Axon Instruments
Questions to answer today? • What is a SynaBASE? • What are the advantages of using SynaBASE? • In which situations has SynaBASE been applied to? • Does the use of SynaBASE offer any advantages for phylogenetics?
Core IP - SynaBASE™ - PLATFORM • Main partners and users in US and EU • 50+ staff split across group • Open approach to development – engine not software • Focused on efficient HPC for Genomics and Life Sciences
COREDatabase platform API calls Graphical Interface Command line interface Applications Data analysis Develop Tools SXParse SXSequenceRefs SynaSearch Bulk SXLRESearch SynaRex Bulk SXFuzzyPatternSearch SynaProbe Bulk SXAlign Sxpet
Software policy • More than 40 existing applications • All open source to licensees of SynaBASE • Users can also develop, modify and share all applications
Similarity & association Common PATTERNS and functionality What do we know about data ?
Pattern Trie A C T AA AC CT TC AAC ACT CTC More memory efficient than variable length data structures AACT ACTC Going to leaf node finds all sources and positions AACTC
Pattern Trie A C T AA AC CT TC AAA AAC ACT CTC Low complexity repeats - filtered AAA AACT ACTC f=100 f=20 High frequency patterns removed from alignment seeding AACTC
Takes 8 minutes for Swissprot The fields in the build form are equivalent to the command-line XML configuration Fields data is converted into XML format and added to the existing entry in the Synabase XML configuration file
Pattern Trie A C T AA AC CT TC AAC ACT CTC AACT ACTC Trie Boundary Frequency is greater than build limit AACTC
Single-server IT architecture SynaBASE & SynaSuite Server HP Integrity rx4640 server Dual Intel Itanium2 1.5GHz CPU 64 GB DDR memory 146GB Ultra320 SCSI hard disk x 2 Red Hat Enterprise Linux AS 3 for IA64
2. SynaBASE enables very fast access A C T • Number of levels small • For a query: • Match 1st longest pattern • Follow Eulerian path through network, picking up longest matching pattern for each posn. In query • Processing time is: • Proportional to query size to obtain all unique subpatterns AA AC CT TC AAC ACT CTC CTCG TCGA AACT ACTC AACTC ACTCG CTCGA
Efficiency leads to high performance Only 15million nodes are needed to represent 56million residues The storage of the shorter nodes has little effect
3. SynaBASE is very fast - Q* logN base A Speed milliseconds 900 800 Conventional 700 600 SynaBASE 500 400 300 200 100 Size of database mega bp 1 10 100 1000
Novel hits BLASTN vs. SynaSearch-Bulk • Cumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences
4. Novel annotation using SynaBASE The elephant and the giraffe walked up the mountain A graph showing Frequency of “string (word)” patterns in a sentence does not reflect meaning The elephant and the giraffe walked up the mountain A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning
a1 a2 a3 a2a3 a1a2 a1a2a3 Expected Frequency SIGNIFICANCE Sig(a1a2a3) = F(a1a2a3) / Ef(a1a2a3) = Fr(a1a2a3) * F(a2) F(a1a2) * F(a2a3) Ef(a1a2a3) = F(a1a2) * F(a2a3) F(a2) Actual Freq/Expec Freq
Gene models correlate with “SIGNIFICANCE” Ensembl Gene F2 F3 PIM1 Oncogene
Example 1 - 454 assembly result • 400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correction • Genome coverage 99.89%
FragBASE – using the SynaBASE structure…. Use corrected FragBASE Select patterns of high coverage Use FragBASE network* to extend patterns Increase pattern size to overcome shorter repeat sections
Example 2 - Microarrays • Probe design – 30000 75mer probes, 8 per gene in 8h compared to previous 3 month+ process • Probe evaluation and mapping • Mapping of 600,000 Affymetrix 25mer probes to Human genome in 17s • Compares to over 2 weeks with BLAST
3 yrs Example 3 – Comparative Genomics 22days 6h SynaBASE PatternHunter BLAST
Example 4 – Genome mapping • Aims: • Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools. • Compare sensitivity, specificity and performance advantages of Synamatix technologies . • Results: • In comparison to BLASTz, SynaSearch: • Is 219 fold faster • Finds 11% more true positives • Finds 17% more unique hits to queries • Has a higher specificity: • 113% fewer false positives • fewer multiple placements per read – 2.7 v 5.3 • Benefits: • Enables significant enhancements in workflow throughput. • 219 fold compute time improvement • SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors. • Results in better assemblies of new genomes. • Reduces current reliance on outsourcing of BLASTz analysis.
“Inference of a phylogenetic network of whole prokaryotic genomes using SynaBASE” Further example of use of SynaBASE engine: applying SynaBASE to Phylogenetics
Outline of study • Primary data set 1: 101 Bacterial and Archaeal Genomes • Used “SynaTree” – exhaustive comparison between “Sequences” in SynaBASE structure • Generates phylogenetic tree • Used prototype Synamatix application: “SXComparePattern” – exhaustive pattern based similarity matching • Evaluation of methods using: • C-score method* • Group visualisation and clustering analysis • Tested “SXComparePattern” method with a larger 488 Bacterial Genome data set *Henz S.R., Huson D.H., Auch A.F. Struwe K.N-. and Schuster S.C. (2005) Whole-genome prokaryotic phylogeny. Bioinformatics. 21(10): 2329-2335
Where: A = alignment score L = length of respective genomes The distance matrix is used to generate a phylogenetic tree Phylogenetics using SynaTree • For each query genome, can search SynaBASE for all alignments with all other genome sequences {srefs, posn, length} • The alignment scores can then be used to calculate a distance matrix:
It can be seen from the chart that the resulting triplet in a sliding window include significant alignments and also spurious short matches that are not significant. The SynaBASE align function, SXAlign, includes a filter to remove the random short alignments or 'noise' from the alignment data. The alignment scores are then used to calculate a distance matrix SynaTree uses SXAlign API for comparing alignments SynaTree uses the SXAlign API for comparing alignments
Chart shows the effect of using diagonal alignment filter on the alignment of 2 Serine Kinase aa sequences Example of filtering
SynaTree for 101 bacterial & archaeal genomes 95 minutes! Compared to 7 days with BLAST
Cyanobacteria Firmicute Chlamydiae SynaTree for 101 bacterial & archaeal genomes
Frequency of each pattern 2nd method: SXComparePattern Raw score for patterns Calculation of distance matrix from raw score by distance formula
Where: A =shared patterns between genomes i and j L= number of patterns for respective genomes Here, the calculation is based on shared patterns between each genomic sequences SXComparePattern Approach • Distance matrix calculated is the same as before with some exceptions:
SXComparePattern tree for 101 bacterial and archaeal genomes 23seconds! Compared to 7 days with BLAST
Chlamydiae Cyanobacteria Firmicute SXComparePattern tree for 101 bacterial and archaeal genomes
Which is essentially a sum of compatible non-trivial splits (Tc) divided by the sum of all non-trivial splits in the test tree Assumption is that the compatability of non-trivial splits is compared against a reference tree which is deemed 'correct'. Evaluation of phylogenetic networks • Evaluation of phylogenetic networks based on c-score proposed by Henz, et al. (2005)
SXComparePattern highlighted above and marked with * is with 488 bacterial sequences Performance comparison • Rapid method for inferring phylogenetic networks.
Summary • SynaBASE platform extensible to phylogenetics • Pattern based approach provides for a very rapid and scalable means of clustering genomes into phylogenetic networks • Enables multi-supercomputer performance from a single server • This same approach can be used to cluster and analyse previously improbable data sets, e.g. • All primate genomes • All genes • Iterative analysis of evolutionary phylogenetics
END OF WEBCAST • Thank you for your participation! • Next Webcast will be on April 30 – “Use of SynaBASE for assembly of reads from 454 Life Sciences sequencing platform” • A full paper of the work presented will be sent to you on Monday next week • Please email: enquiries@synamatix.comif you have any questions or would like a free trial