700 likes | 892 Views
Workflow Systems in Bioinformatics and the Bioinformatics Educational Grid. Tan Tin Wee Associate Professor National University of Singapore tinwee@bic.nus.edu.sg Shoba Ranganathan, Victor Tong, Justin Choo, Richard Tan, G.S.Ong, Simon See, TS Lim, Mark de Silva and KSLim.
E N D
Workflow Systems in Bioinformatics and the Bioinformatics Educational Grid Tan Tin Wee Associate Professor National University of Singapore tinwee@bic.nus.edu.sg Shoba Ranganathan, Victor Tong, Justin Choo, Richard Tan, G.S.Ong, Simon See, TS Lim, Mark de Silva and KSLim. International Symposium on Grid Computing ISGC2004 “Making the World Wide Grid a Reality” 27 July 2004
In a Nutshell • Weaving several threads of development in Bioinformatics such as Workflow Integration and DataGrid (over the past 5 yrs or so) • to build an integrated educational grid “info”-structure • that will support HR development, education, training, self-learning etc in the emerging discipline of bioinformatics • for conventional as well as “E-” eduation
Making the World Wide Grid a reality: Contribution of Bioinformatics • Bioinformatics is the science of using information and ICT to understand biology • Despite being driven by rapid progress in allied disciplines in the “New Biology”: genomics, proteomics, metabolomics, transcriptomics, other ‘omics, computational biology, systems biology generating unprecedented volumes of data • Grid computing is not yet ubiquitous in life sciences
In Vitro In Vivo In Situ In Silico Biology And Personalised Medicine Imaging Modeling Simulation Theoretical Biology D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell., 100(1):57–70 Review, 2000
“Tools have changed, but the job hasn’t”Cartoons from talk by Rozhan Mohammed Idrus & Hanafi Atan, Universiti Sains Malaysia, APAN 2003 • Bioinformatics - Emergent and almost pervasive in all biological and life science disciplines
Computational Demands and Data Processing in Life Sciencesare expanding! ‘omics Genomics Proteomics Bioinformatics Computational Biology Medical Informatics BioStatistics LIFE SCIENCE INFORMATICS LIFE SCIENCES and HEALTH SCIENCES
Where does the Grid fit in? Life Science Informatics BIOTECHNOLOGY and NEW BIOLOGY INFOCOMMUNICATIONSTECHNOLOGY
Why no Grid here yet? • Lack of widespread awareness and training in computational skills in the life sciences community • Few computational, networking and grid computing experts with first hand domain knowledge in life sciences • Data-intensive nature of life science grid computing applications • Labour-intensive nature of building life science grids • Lack of Killer Applications • Bioinformatics is a Rapidly changing target
Biotech and InfoComm Technology - Parallel Growth Systems Biology Worm Genome Dolly & DNA chips Human Genome Genome Project Microbial Genomes BioX Biotechnology 1990 92 94 96 98 2000 2002 2004 InfoCommunication Technology Dotcom boom And crash Wais Gopher Lambda Networking Internet 2 WWW boom Grid Computing Java ISP
Grids applied to Life Sciences • Internet2 demos of late 1990sQuasi-realtime data collection from synchrotrons for 3D structure determination • iGrid98, SC’98, SC’99, SC2003 (most geographically dispersed grid computing award – arthropod phylogenetics) • Anthrax research – United Devices • Encyclopedia of Life (EOL) • OBIGrid • Kansai BioGrid • Large scale mega projects • Not a WorldWide Grid
iGrid98SC’98 http://www.startap.net/startap/igrid98/maxLikeAnApbionet98.html
INET’99demo http://www.bic.nus.edu.sg/admin/News/Jun99/inet/inet99.html http://www.startap.net/startap/APPLICATIONS/collabForStruct.html
When will World Wide Grid be a reality for Life sciences? • Like World Wide Web – everyone uses it, from publication and accessing the content • Plug and play: Tap computational cycles anywhere from everywhere anytime • Secure to use • Killer application like Mosaic in 1993 • Generate meaningful results • Control key tools and automate mundane processes • Connect people, computation, data, instruments
Focus on two key areas • Grid-enabled bioinformatics workflows systems as the killer application • Building a bioinformatics educational grid
Workflow integration • 1996/7 Java based FlowBot project • 1998 Inet98 Internet Flowbot Protocol • http://www.isoc.org/inet98/proceedings/8x/8x_1.htm • 1998 Application to Life Sciences – Workflow Integration – BIC-CNPR joint project – Lim et al • 1998 PSB’98 From Sequence to Structure to Literature: The protocol approach to Bioinformation Wu et al spinoff company GeneticXchange.com • 2001 Spinoff Company KOOPrime Pte Ltd • 2002 BioWorldWideWorkFlow initiative in APBioNet Workflow integration is the Killer Application for a World Wide Life Science Grid!
Bioinformatics Educational Grid • 2001 - S* Life Sciences Informatics Alliance – 3 years of experience in online bioinformatics education: 5 courses and >1000 persons worldwide trained in basic bioinformatics Team of Online Teaching Assistants • Workshop on Education in Bioinformatics: WEB01, WEB02, WEB03, WEB04 • 2004 – Problem Based Learning PBL in Bioinformatics online using emeet.nus.edu.sg • 2004 – Building the Bioinformatics Educational Grid Education is the answer to making the World Wide Grid a reality
Background • Biologists and Biotechnologists need to be equipped and trained to carry out tomorrow’s biological research today! • Integration of • Network Infrastructure • Databases • Software • Computational Grid • Online educational and teaching and learning materials • Education + Killer Application
1. Network Infrastructure APAN Advanced Research Network 1996-2004
Internet2 and beyond • 1st Country outside North America to connect: SINGAREN – Singapore Advanced Research and Education Network • TANET2 from Taiwan and APAN-Transpac were next. • Then Abilene…. • … Today’s Starlight and Lambda networking
2. Databases • Key major databases - 1.5 Terabytes today! • Publicly accessible data over the Internet doubling every 12 to 18 months http://www.bio-mirror.net/ • Mirroring Moore’s Law for chip technology
BIODATABASES Genbank Genbank Genomes InterPro PDB BlastDB BLOCKS DDBJ EMBL ENZYME PROSITE PIR PFAM REBASE NCBI REFSEQ SRC SWISSPROT Taxonomy TrEMBL UniGene euGenes
BioDataGrid: Registry of Databases • NUS BioDataGrid initiative everest.bic.nus.edu.sg/lsdb • Singapore National Grid Office has a new initiative – to be announced soon. • Facilitate varying levels of granularity of access to structured and unstructured biological data
3. Software • APBioBox project • Funded by IDRC Pan Asia Networking R&D Grant • Rapid and Easy Replication of Grid enabled software crucial to grid growth
3. Software - APBioBox • Funded by International Development Research Centre of Canada, under their PAN Pan Asia Networking ICT grant • To build an easily installable, widely and freely accessible, integrated suiteof bioinformatics applications to faciliatetraining and research amongst biologistsin developing countries • A/P Tan Tin Wee, National University of Singapore • Adjunct Professor Shoba Ranganathan, NUS and Chair Professor, Macquarie University, Sydney • Ong Guan Sin, Consultant programmer, Singapore Computer Systems Pte Ltd
3. Software - APBioBox • Shrink-wrapped bundle of some 300 software applications used in bioinformatics • Preconfigured and integrated • 15 mins to install on a Linux RedHat9 platform which typically takes several weeks to set up. • Partnered with Sun Microsystem to come up with Bio-Cluster Grid, the equivalent in Sun Solaris platform. • CDROMs and Downloadable http://www.apbionet.org/apbiogrid/apbiobox
3. Software – APBioBox appls Logical Abstraction through Java Wrappers built for: • EMBOSS ~160 applications • PHYLIP ~30 applications • HMMER • CLUSTALW • BLAST • FASTA, SSEARCH (in progress) • MySQL • SRS (Lion Bioscience) • Globus Grid Toolkit 2.4 • Unix Utilities • KOOPlite • Key Bioinformatics Databases (in progress)
4. Computational Grid APBioGrid 2002 To faciliate the building of a shared computational grid resources for the Asia Pacific region.
APBioGRID Project APBioGrid Aims to provide computational resources to bioinformaticians and biological researchers to facilitate education and research through sharing each other’s computers over the Grid CRAY
Why APBioNet Grid is needed? • Large-scale [life] science [..] are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed. • The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale [life] science […]. Altered from Bill Johnston 27 July 01
Why the “Grid”? • 1998: advent of Grid Computing – distributed computing • E.g. Tapping idle CPU cycles globally in the SETI project or the Anthrax online projects. • “Like tapping electrons from the power grid, just plug in the appliance into the socket” • Currently, one of the hottest areas in ICT. • So the basis for BioGridshas been laid
5. Online Learning Material Eight institutions from 5 continents since 2001 – The S* Life Science Informatics Alliance Sweden Karolinska Institutet University of Uppsala USA Stanford University University of California, San Diego National University of Singapore Singapore Australia University of Sydney Macquarie University South Africa University of the Western Cape
Wide Range of S* learning materials - Tutorial ppt presentation materials on introductory bioinformatics - Frequently Asked Questions in Forum discussion archives - Overview lectures on: • Introductory Molecular Biology • An Overview of the Computational Analysis of Biological Sequences • Transcript Analysis and Reconstruction • Comparative Genomics • Representations and Algorithms for Computational Molecular Biology • Protein Structure Primer, Structure Prediction and Protein Physics • Genomics and Computational Molecular Biology Genomics • Protein and Nucleic Acid Structure, Dynamics,and Engineering • Proteomics and Proteomes • Structure Prediction for Macromolecular Interactions • Protein - Ligand Modeling • Microarray informatics
Goals of S* • Provide a GLObal Bioinformatics Unified Learning Environment (GLOBULE) made up of modular courses in the disciplines of bioinformatics, medical informatics and genomics • Provide accessibility to the highest possible quality of online courseware approved by the educators from the host institutions. • Develop an integrated modular learning environment that allows a student to select from both pre-requisite modules and advanced modules in order to build a comprehensive program.
S* Geographical Comparison north america africa south america
Feedback • Pretty good. A few rough edges but I'm sure you'll work them out over time. I really enjoyed it. Most of the lectures were very well presented and the participants in the forums helpful. I'm very impressed at the amount of work that has obviously gone into setting up the course. ~ Alan Wardroper, Thailand • The international participation of the lecturers and students. The relevance of the field of bioinformatics in meeting the biomedical needs of today. The level of communication provided by the IVLE system enhanced learning considerably. The range of professional and academic background of students. The technical support provided by SStar was rapid and efficient to queries. ~ C.A.O. IDOWU, England
Feedback • To think that a world-class, web based education with such valued lectures is brought to your desk free of cost is impossible elsewhere. The course was wonderfully well managed. Our requests and problems were quickly and well attended to. I had a great time doing this course and thank the S*STAR team whole heartedly for making me a fortunate participant with this fantastic experience. ~ Naidu Ratnala Thulaja, Singapore • I think it is a very useful course, it is exactly what it says it is: an introduction to bioinformatics. It covers nicely major topics and provides enough information in order for us to understand what bioinformatics is all about. I enjoyed it very much and I am even a bit sad it is over. Thank you very much! ~ Patricia Severino, Romania
Emergence of Grid Technologies • “The Grid” - Grid Computing • Next Generation Internet technologies (Internet2) and their applications • Computational Grids • Informational Grids • Access Grid • Educational Grids do the same for the educational process – the learner or the teacher can tap into learning materials, tools, information, computational hands-on, in the so-called classroom without walls!
Educational Grid for Bioinformatics • Increase repository of regularly used bioinformatics software • Registry of tools, software and databases • Higher level abstraction of resources • Virtual classrooms and discussions • Distributed repository of learning objects and materials • Self assessment tests • Project Based modules • Problem Based Learning • Integrated learning environment for the practice of bioinformatics in the life sciences • Support both conventional and e-learning/e-education
Problem-Based Learning (PBL) • Started at McMaster University Medical School over 25 years ago • Encourages hand-on and critical thinking. Its hands-on approach is particular suited for bioinformatics where many of the skills require practical execution and the problems encountered are generally open-ended. • PBL encourages : • acquisition of critical knowledge. • problem solving proficiency; problems tackled are generally open-ended. • self-motivated learning. • team participation.
Role Change • In PBL, there’s a fundamental change in the role played by the participants. • a facilitator guides the entire session. • a scribe records the entire session. • some participants field questions; others try to brainstorm and provide answers. There will not be student-teacher relationship,everybody is treated equally. Focus is on peer learning
PBL Asynchronous Sessions • S* is currently experimenting PBL session using IVLE discussion forum and eventually web-based collaboration platform – TWiKi (http://twiki.org) • Consideration/Issues to resolve : • How to accommodate so many participants • How to host so many TWiKi page • Will participants with slow connection able to access ?
PBL synchronous sessions • Emeet.nus.edu.sg • CENTRA technology • Low bandwidth requirement • VOIP for voice, Video if necessary • Agenda, Whiteboard, Shared applications, File transfer, Web Safari
Projects • 8 different projects • 8 teams of volunteer facilitators • 300 students into 8 groups • Two phases • Set them up to solve various topical bioinformatics problems from bottom up in PBL style.
Online Delivery Mechanism • Consider and want to explore various advanced networking technologies particularly on video conferencing software. • e.g. AccessGridTM http://www.accessgrid.org/