ILRI/BECA Bioinformatics Platform Introduction

ILRI/BECA Bioinformatics PlatformIntroduction Etienne de Villiers ILRI - Kenya

Outline • ILRI/BECA Bioinformatics Platform • Hardware • Specialized software: • Database searching • Assembly software • CGIAR Bioinformatics Grid

International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills

ILRI Research Objectives • Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. • Undertakes a balance of fundamental and applied research with long, medium and short term objectives. • Livestock health, genetics, and management.

ILRI Facilities • State of the art laboratories (2500 m2) • Large and small animal facilities • Level-2/3 biosafety facility for cattle and sheep • Bioinformatics unit • 64 CPU Paracel 64-bit HPC cluster • Sequencing unit • ABI 3730 and ABI 3100 • Microarray facility • Proteomics facility • Oligonucleotide synthesis unit • FACS analysis facility • Tick unit

BECA - Biosciences East and Central Africa • Under NEPAD several centers of excellence are being established in Africa. • One center is being established at ILRI –Biosciences East and Central Africa (BECA). • Center will provide state-of–the-art facilities for scientist in the region. • Facilities include: • Genetics and Genomics lab with high throughput sequencers • Microarray laboratory • Proteomics laboratory • Immunology and molecular biology laboratories • Bioinformatics Platform

ILRI/BECA – Bioinformatics Platform • Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. • EMBNet node for East and central Africa

IBBP services • Access to bioinformatics tools through either: • web-based bioinformatics tools through the BBP website • secure shell (ssh) access for registered users • Facilities for storage of large datasets • Systems administration and backup of datasets • Training and support in the use of BBP resources • Graduate and Post-graduate Fellowships in Bioinformatics

IBBPFacilities • Training room • 18 computers with MS windows and Linux • High speed internet connection • Servers • 66 CPU Beowulf Linux cluster • High availability Web server

IBBP Website www.becabioinfo.org

Selection of available tools on IBBP • Paracel Blast • GeneMatcher2 • PTA • Oligocheck • EMBOSS 200+ bioinformatics tools • ClustalW multiple alignment software • T-coffee multiple alignment software • FastA sequence alignment tool • HMMER multiple alignment and sequencesearching software • Staden sequence assembly and analysis package • Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package • Phred/Phrap DNA editing and assembly tools • R statistical package • Rosetta – Ab initio protein prediction • SRS – sequence retrieval tool • Etc……

IBBP Hardware Systems HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage • Paracel Blast Machine • Parallel NCBI-Blast (20 CPU ) • Blast • PSI-Blast • Mega-Blast • GeneMatcher2 • 6144 CPU supercomputer • HMM • Smith-Waterman • GeneWise • Profile

Linux cluster • Rocks 4.1 (RedHat) operating system • Platform LSF batch queuing • shares resources equally between users • MPI libraries • Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Application Integration Middleware (Platform LSF) Batch Queue Setup Operating System (Red Hat - ROCKS) Turnkey HPC Integration Node Node Node Node Node Cluster Build and Configuration Network (GiGE)

Database searching • Heuristic Algorithms (FASTA and BLAST) • Gapped BLAST • Traditional ungapped BLAST • Are fast but give approximate alignments • Dynamic Programming Algorithms • Global – Needleman-Wunsch • Local – Smith-Waterman • Give optimal alignment but are very slow

Paracel Blast Server • Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems • 20 CPU parallel NCBI-Blast • 20x faster than NCBI-Blast server Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Paracel Blast – 1h 9m 56s Database – Human Ref. Seq 10,300 sequences 24,300,000 bases NCBI – 6 days 2h 20m 34s

BioView Viewer Paracel Blast Server

BioView Viewer

Gene Structure Determination • To compare a cDNA or EST database to a genomic database, one must allow introns • Two approaches: • Double-affine Smith-Waterman (separate gap penalty for introns) • Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)

How to get more distant homologs • Use dynamic programming algorithms • Use position-specific or HMM profiles • Do iterated searches • Use translated searches • Must be careful in interpretation (statistics)

GeneMatcher2 • Do things you either can’t or wouldn’t attempt at NCBI (100x faster) • Is a computer specialized for executing calculation intensive methods in bioinformatics: • Especially fast in performing the very sensitive Smith-Waterman pairwise alignment method • compensate for frame shifts • GeneWise • intron- and frameshift-tolerant search method • Needleman-Wunch alignments • HMM searches • 6,144 parallel processor computer

Why GeneMatcher2? • Comparison of sensitivity and selectivity of various sequence search methods • Blue denotes a software method • Yellow denotes a hardware accelerated method Less False positives More true positives

GeneMatcher2 - Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-fold Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project

BioView Viewer BioView Workbench

BioView Viewer

Assembly Software • Paracel Transcript Assembler (PTA) • High capacity solution for ESTbased transcript reconstruction • Can assemble large numbers of ESTs, allowing for splice variants • Complete pipeline for: sequencecleaning,clustering and assembly • Detection, alignment and visualization of alternative splice forms • Visualization through intuitive graphicalinterfaces

Scientific problems for PTA • Proteomics • Gene discovery • Verify gene predictions for genome assembly • Detecting splice variants • Patterns of expression, tissue specificity • SNP detection • Combinations of all the above...

PTA – Contig view

PTA – Splice variant alignment

Paracel Oligocheck • Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 • Search oligo’s fast against whole genome • Software used by companies designing and synthesizing oligonucleotides e.g. MWG

Ensemble mirror • Ensembl is a joint project between EMBL - EBI and the Sanger Institute. • A software system which produces and maintains automatic annotation on selected eukaryotic genomes. • Our site provides free access to a selected areas of the data and software from the Ensembl project.

CGIAR – HPC GRID computing ILRI Kenya ICRISAT India 33 nodes Genematcher2 4 nodes 49 nodes 89 CPUs BECA/Partners IRRI Philippines CIP Peru 8 nodes 4 nodes

Thank you

ILRI/BECA Bioinformatics Platform Introduction

ILRI/BECA Bioinformatics Platform Introduction

Presentation Transcript

A Short Introduction to Unix for Bioinformatics

ILRI-KENYA@cgiar ILRI is a Future Harvest Centre

A Short Introduction to Unix for Bioinformatics

Bioinformatics Training

Introduction to Bioinformatics 236523/234525

Bioinformatics 2

Bioinformatics and Biocomputing BM1106

Introduction to Genomics and Bioinformatics

Appolinaire Djikeng, PhD BecA -ILRI Hub, ILRI, Nairobi, Kenya

BIOINFORMATICS Introduction

MNW2 course Introduction to Bioinformatics

Bioinformatics

Members of CONCORD

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to BioInformatics

AfricaRISING Ethiopia

Director General, ILRI 2002-2011