320 likes | 448 Views
ILRI/BECA Bioinformatics Platform Introduction. Etienne de Villiers ILRI - Kenya. Outline. ILRI/BECA Bioinformatics Platform Hardware Specialized software: Database searching Assembly software CGIAR Bioinformatics Grid. International Livestock Research Institute.
E N D
ILRI/BECA Bioinformatics PlatformIntroduction Etienne de Villiers ILRI - Kenya
Outline • ILRI/BECA Bioinformatics Platform • Hardware • Specialized software: • Database searching • Assembly software • CGIAR Bioinformatics Grid
International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills
ILRI Research Objectives • Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. • Undertakes a balance of fundamental and applied research with long, medium and short term objectives. • Livestock health, genetics, and management.
ILRI Facilities • State of the art laboratories (2500 m2) • Large and small animal facilities • Level-2/3 biosafety facility for cattle and sheep • Bioinformatics unit • 64 CPU Paracel 64-bit HPC cluster • Sequencing unit • ABI 3730 and ABI 3100 • Microarray facility • Proteomics facility • Oligonucleotide synthesis unit • FACS analysis facility • Tick unit
BECA - Biosciences East and Central Africa • Under NEPAD several centers of excellence are being established in Africa. • One center is being established at ILRI –Biosciences East and Central Africa (BECA). • Center will provide state-of–the-art facilities for scientist in the region. • Facilities include: • Genetics and Genomics lab with high throughput sequencers • Microarray laboratory • Proteomics laboratory • Immunology and molecular biology laboratories • Bioinformatics Platform
ILRI/BECA – Bioinformatics Platform • Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. • EMBNet node for East and central Africa
IBBP services • Access to bioinformatics tools through either: • web-based bioinformatics tools through the BBP website • secure shell (ssh) access for registered users • Facilities for storage of large datasets • Systems administration and backup of datasets • Training and support in the use of BBP resources • Graduate and Post-graduate Fellowships in Bioinformatics
IBBPFacilities • Training room • 18 computers with MS windows and Linux • High speed internet connection • Servers • 66 CPU Beowulf Linux cluster • High availability Web server
IBBP Website www.becabioinfo.org
Selection of available tools on IBBP • Paracel Blast • GeneMatcher2 • PTA • Oligocheck • EMBOSS 200+ bioinformatics tools • ClustalW multiple alignment software • T-coffee multiple alignment software • FastA sequence alignment tool • HMMER multiple alignment and sequencesearching software • Staden sequence assembly and analysis package • Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package • Phred/Phrap DNA editing and assembly tools • R statistical package • Rosetta – Ab initio protein prediction • SRS – sequence retrieval tool • Etc……
IBBP Hardware Systems HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage • Paracel Blast Machine • Parallel NCBI-Blast (20 CPU ) • Blast • PSI-Blast • Mega-Blast • GeneMatcher2 • 6144 CPU supercomputer • HMM • Smith-Waterman • GeneWise • Profile
Linux cluster • Rocks 4.1 (RedHat) operating system • Platform LSF batch queuing • shares resources equally between users • MPI libraries • Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Application Integration Middleware (Platform LSF) Batch Queue Setup Operating System (Red Hat - ROCKS) Turnkey HPC Integration Node Node Node Node Node Cluster Build and Configuration Network (GiGE)
Database searching • Heuristic Algorithms (FASTA and BLAST) • Gapped BLAST • Traditional ungapped BLAST • Are fast but give approximate alignments • Dynamic Programming Algorithms • Global – Needleman-Wunsch • Local – Smith-Waterman • Give optimal alignment but are very slow
Paracel Blast Server • Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems • 20 CPU parallel NCBI-Blast • 20x faster than NCBI-Blast server Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Paracel Blast – 1h 9m 56s Database – Human Ref. Seq 10,300 sequences 24,300,000 bases NCBI – 6 days 2h 20m 34s
BioView Viewer Paracel Blast Server
Gene Structure Determination • To compare a cDNA or EST database to a genomic database, one must allow introns • Two approaches: • Double-affine Smith-Waterman (separate gap penalty for introns) • Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)
How to get more distant homologs • Use dynamic programming algorithms • Use position-specific or HMM profiles • Do iterated searches • Use translated searches • Must be careful in interpretation (statistics)
GeneMatcher2 • Do things you either can’t or wouldn’t attempt at NCBI (100x faster) • Is a computer specialized for executing calculation intensive methods in bioinformatics: • Especially fast in performing the very sensitive Smith-Waterman pairwise alignment method • compensate for frame shifts • GeneWise • intron- and frameshift-tolerant search method • Needleman-Wunch alignments • HMM searches • 6,144 parallel processor computer
Why GeneMatcher2? • Comparison of sensitivity and selectivity of various sequence search methods • Blue denotes a software method • Yellow denotes a hardware accelerated method Less False positives More true positives
GeneMatcher2 - Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-fold Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project
BioView Viewer BioView Workbench
Assembly Software • Paracel Transcript Assembler (PTA) • High capacity solution for ESTbased transcript reconstruction • Can assemble large numbers of ESTs, allowing for splice variants • Complete pipeline for: sequencecleaning,clustering and assembly • Detection, alignment and visualization of alternative splice forms • Visualization through intuitive graphicalinterfaces
Scientific problems for PTA • Proteomics • Gene discovery • Verify gene predictions for genome assembly • Detecting splice variants • Patterns of expression, tissue specificity • SNP detection • Combinations of all the above...
Paracel Oligocheck • Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 • Search oligo’s fast against whole genome • Software used by companies designing and synthesizing oligonucleotides e.g. MWG
Ensemble mirror • Ensembl is a joint project between EMBL - EBI and the Sanger Institute. • A software system which produces and maintains automatic annotation on selected eukaryotic genomes. • Our site provides free access to a selected areas of the data and software from the Ensembl project.
CGIAR – HPC GRID computing ILRI Kenya ICRISAT India 33 nodes Genematcher2 4 nodes 49 nodes 89 CPUs BECA/Partners IRRI Philippines CIP Peru 8 nodes 4 nodes