180 likes | 327 Views
Paracel GeneMatcher2. Overview. GeneMatcher2. The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms. There are two hardware components: GeneMatcher accelerator
E N D
Paracel GeneMatcher2 Overview
GeneMatcher2 • The GeneMatcher system comprises of hardware and software components that significantly accelerate a number of computationally intensive sequence similarity search algorithms. • There are two hardware components: • GeneMatcher accelerator • Post-Processor (Blastmachine) • Two client intefaces: • Unix command line • Web-based GUI (BioView Workbench)
Switch GeneMatcher2 Architecture CPU 1 CPU 2 CPU 6912 ... g a a Query #n ... Query #1(agaggt..) Web interface GeneMatcher2 Blast machine
GeneMatcher2 System • Massively Parallel Bioinformatics supercomputer • Array of ASIC (Application Specific Integrated Circuit) chips combined with state-of-the-art Linux cluster technology • Accelerates dynamic programming search algorithms • 3,000 to 220,000 processors • Thousands of times faster than general purpose computers
GeneMatcher2 Components 3 Processor units (6,142 processorsper unit) ULTRASparccomputer Up to 4 disk drives For database storage
GeneMatcher2 Algorithms • HMM and HMM-Frame • Searches protein or DNA sequence data with domain models • HMM-Frame aligns protein models to DNA with frame shift and optional intron tolerance • Profile and Profile-Frame • Position-specific scoring with profile models • Frame shift tolerant protein profile searches against DNA sequence data • GeneWise • Aligns protein sequences or HMM against genomic data • Tolerates introns and frame shifts
GeneMatcher2 Algorithms cont, • Smith-Waterman • Comparison of DNA-DNA, Protein-Protein, Protein-DNA or DNA-DNA through protein • Frame algorithms tolerate frame shifts, unlike BLAST counterparts • Optional intron tolerance for searches of genomic data • Highly sensitive search capacity finds hits BLAST potentially misses • NCBI Blast
What about Blast? • Blast is an approximation of Smith-Waterman • So is FastA, but it's better and has protein fragment searches • Approx. may not yield correct results in some situations: • Data with many ambiguities or frameshifts, such as raw ESTs and unfinished genomic sequence • Distantly related sequences • When global alignments are desired • Protein alignment of Sequences with introns (not penalized on GeneMatcher)
Why GeneMatcher2 • Comparison of sensitivity and selectivity of various sequence search methods • Sensitivity: What proportion of the real hits are reported? (More sensitive means more real hits) • Selectivity: What proportion of the reported hits are real? (More selective means less false positives) Less False positives More true positives
GeneMatcher2 Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-foldimprovement Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project
Running a search • Load a sequence (or set of sequences) as a query set if it will be used several times • Select the appropriate search depending on the query type and database type (only suitable candidates will be displayed on the search forms) • Check your form options! • Watch the search queue (can raise priority of small jobs if machine is busy) • Select a result format
Databases • While you can load your own databases, disk space on the post-processor is not infinite! Ask us about maintaining public databases that are not currently available. • If you upload a private database. Special files need to be created to use translated database searches such as rframe. • You can create private data sets to search against (e.g. Unigene-mouse and Unigene-rat in a data set called Unigene-rodent). These don’t take up any space.
Seq 1 Seq 2 Seq 3 Seq 4 THE LAST FAT CAT THE FAST CAT THE VERY FAST CAT THE FAT CAT Position specific Positive examples THE LAST FA T CAT THE FAST CAT THE VERY FAST CAT THE FA T CAT THE LAST FAST CAT orororor or VERY gap gapgapgap THE LAST FAST CAT +++ ++++ ++++ +++ all matches “AST” from LAST “V”from VERY } Hidden Markov Models Positive examples Query Query THE VAST VERY FAST CAT THE VAST FAST CAT Hidden Markov Model Multiple sequence alignment (Clustalw or T-coffee) Only nothing, “LAST” or “VERY” in that position GeneMatcher2 HMM Build
GeneWise • Predict introns and exons based on conserved protein domains (e.g Pfam database) • Uses HMMs, reverse query/data set relationship holds • Unlike genscan or fgenes, you can believe these hits, though they may not be complete where exons don’t contain conserved domains.