590 likes | 892 Views
Bioinformatics in Cancer Biotechnology. Bob Stephens Advanced Biomedical Computing Center Advanced Technology Program SAIC-Frederick, Inc. National Cancer Institute at Frederick April 19, 2007. Objectives. Overview/introduce bioinformatics concepts, applications and databases.
E N D
Bioinformatics in Cancer Biotechnology Bob StephensAdvanced Biomedical Computing CenterAdvanced Technology ProgramSAIC-Frederick, Inc.National Cancer Institute at Frederick April 19, 2007
Objectives • Overview/introduce bioinformatics concepts, applications and databases. • Describe interplay between bioinformatics, technologies and the web. • Profile importance of bioinformatics in cancer research. Cancer Biotechnology Series
What is bioinformatics ? • Bioinformatics is the application of computational methods to the analysis of any type of biological data. • Bioinformatics has become a diverse and multi-disciplined field that originally derived from computer science and biological science. Cancer Biotechnology Series
Evolution of bioinformatics • Rapid technological advances in sequence determination set the pace for data acquisition. • Similar advances in computing power and algorithmic approaches for sequence analysis, robotics enabled instruments. • Co-evolution with web browser and programming language technologies. Cancer Biotechnology Series
Bioinformatics evolution (contd.) • Additional high throughput technologies becoming available almost daily - microarrays, proteomics, population and genetic data, medical literature etc. • Data volume is increasing at the same time as data complexity. • Data distribution/synchronization becoming an increasingly difficult task. Cancer Biotechnology Series
Interplay between technology and bioinformatics • New HT Technologies, eg. mRNA microarray • Analysis and storage software • Computational infrastructure • Data integration Cancer Biotechnology Series
Example • mRNA expression chip (20000 genes x 16 probes per gene), a few mb per sample. • Data normalization software. • Exon array - multiple probes for each exon for each of the 20000 genes - one file about 1gb. • New normalization method requires all samples to be loaded simultaneously. • More complex analysis reveals alternative splicing etc. Cancer Biotechnology Series
Interface of technologies and biology • Experimental design very important in HT biology • Experiments shaped by data access and availability • Re-analysis of old data with new methods important Cancer Biotechnology Series
Bioinformatics historical perspective • Stage 1 - bioinformatics term is coined to represent what had been DNA and protein sequence analysis (ca. 1995) • Stage 2 - additional disciplines become rolled into bioinformatics including literature mining, statistical analysis, and virtually anything to do with computational analysis of biological data. (ca. 2000) Cancer Biotechnology Series
Bioinformatics - historical perspective (contd) • Realization that bioinformatics is too broad a term, other disciplines break away eg. OMICs fields (eg genomics, proteomics others (ca. 2001). • Still later (current) realization is made that we wont be able to make any sense of individual disciplines without integrating them together, term now changed to integrative biology or systems biology (ca. 2003). Cancer Biotechnology Series
Importance of bioinformatics • Bioinformatics has become a major part of both the NCI 2015 directive and the NIH Roadmaps. • Virtually impossible to perform biological research without some form of computer aided analysis, especially in areas like genomics and proteomics. • Important to keep scientific community in touch with developing technologies and capabilities for highest return on research investment. Cancer Biotechnology Series
Bioinformatics infrastructures • Command-line implementations. • Primitive GUI implementations. • Sophisticated GUI interfaces and application packaging. • Web interface and Java language gives platform independent access. • PC-based, web-based and server-based architectures. • Multiple tier infrastructures distributes computational burden. Cancer Biotechnology Series
What does bioinformatics technology involve ? • Computer readable form of some type or types of biological data (instruments) • Automation also requires programmable robotics capabilities (process science). • Computer infrastructure for storing and analyzing the data. • As data volume and complexity grows, the dependency on computer analysis increases. Cancer Biotechnology Series
Sources of bioinformatics technology • Computer science leveraged technologies including algorithms and data representation models, visualization frameworks and programming languages. • Web industry leveraged technologies including communication protocols, web servers and secure access. • Database industry derived connectivity and technologies. • Robotics and process engineering technologies for faster, cheaper throughput. Cancer Biotechnology Series
What can bioinformatics technology do for biological science ? • Develop uniform data standards and controlled vocabularies to allow for integration of disparate sources/types of data. • Connect scientists to entire wealth of knowledge from basic science results to clinical trial data in context-sensitive manner. • Fully integrate worldwide volume of knowledge, for example patient information disease->treatment->outcome across multiple centers to allow for cross-comparisons. Cancer Biotechnology Series
NCI Resources • caBIG NCICB Initiatives to develop integrated data/tool environment.. • Long term project requiring unprecedented cooperation, sharing. • Short term solutions for day-to-day problems. • Solution - use multiple approaches, staged implementation and layered technologies Cancer Biotechnology Series
ABCC hardware • 128 cpu linux cluster (3.0 ghz processors). • 256 cpu linux smp box with 1Tb memory. • 64 cpu IRIX smp box with 256gb memory. • 32 cpu IBM AIX smp computers. • 16 cpu IBM HPC AIX smp computer. • 8 x 8cpu IRIX computers. • Other miscellaneous computers, disk storage, tape backup and network connectivity. • Graphics visualization wall Cancer Biotechnology Series
ABCC Organization • Networking and Security • System administration • Scientific program development • Bioinformatics support • Staff ~ 40 Cancer Biotechnology Series
ABCC Training Programs • Classes for NIH/NCI scientists: • Unix, GCG, Java, High throughput sequence analysis, Geospiza (LIMS) • Eudora, Advanced Eudora, Webmail • Homology, Docking, QSAR, Intro to Modeling, Phred, Phrap, Consed • One-on-one consulting services and training. • Organize and host vendor specific training in genomics, pathways, and modeling Cancer Biotechnology Series
ABCC Support within ATP Proteomics and Analytical Technologies (LPAT) Computational Support Database Tools/Pathways Mass Storage and Archive Pattern Analysis and Clustering Molecular Technologies (LMT) Image Analysis (IAL) Computational Support Database Tools and LIMS Mass Storage and Archive Bioinformatics/Web Pattern/SNP Analysis ABCC Algorithm and Software Image Database Mass Storage and Archive Viz Technology Development Gene Expression (GEL) Protein Chemistry (PCL) Software Support Gene Assembly and Validation Protein Expression (PEL) Animal Sciences (LASP) Mass Storage Database POET/Web Cancer Biotechnology Series
ABCC applications • Sequence analysis - protein and nucleic acid, GCG and EMBOSS. • Sequence assembly, SNP detection. • Gene finders, analysis tools. • Molecular modeling, docking. • Molecular evolution and phylogeny. • Computational chemistry. • Linkage analysis. • Proteomics. • Classification tools (microarray and proteomics). Cancer Biotechnology Series
ABCC databases • Genbank and derived divisions. • Refseq, WGS, unigene divisions. • dbSNP, gene, OMIM, homologene. • UCSC, EBI and ncbi genome datasets. • LIMS systems, data management. • Uniprot, PDB, PIR, iProClass, Swissprot. • CGAP, MGC data files, pathways. • Medline, transfac and repeats data files. Cancer Biotechnology Series
ABCC web resources • ABCC General information web page http://www.abcc.ncifcrf.gov • ABCC account application information http://www.abcc.ncifcrf.gov/apps_apply.shtml • ABCC Training web page http://www.abcc.ncifcrf.gov/training/courses.shtml • ABCC scientific applications webpage http://www.abcc.ncifcrf.gov/app/htdocs/appdb/index.php • ABCC GRID Database web page http://grid.abcc/ncifcrf.gov • ABCC Pipelines web page http://www.abcc.ncifcrf.gov/app/login/login.php Cancer Biotechnology Series
The role of bioinformatics in cancer research • Diagnosis - identify classifiers to better sub-divide cancer etiologies into groups. Better individual data to put treatment and individual together. • Treatment - identify better methods to track treatment progress and indicate problems earlier. • Prevention - understand mechanisms for cancer initiation, progression and development and identify targets in this process. • Connect cancer patient data from geographically distributed cancer patients for more complete analysis. Cancer Biotechnology Series
Protein analysis tools • Protein composition, isoelectric point, molecular weight analysis tools. • Comparable alignment/searching tools for proteins. • Protein secondary structure prediction tools. • Protein structure modeling tools. Cancer Biotechnology Series
Genomics tools • Gene finder and general genome annotation tools. • Cross genome comparison tools and databases. • Large scale sequence assembly and polymorphism identification tools. • Genomic visualization tools (UCSC, NCBI, Ensembl). • Data cleansing tools - vector screening, repeat masking. Cancer Biotechnology Series
Gene expression tools • EST Clustering and differential expression analysis tools and databases. • SAGE Analysis tools and databases. • Microarray data collection, calibration and analysis tools and databases. • Gene clustering and visualization tools. • Integration tools - pathways, regulatory networks and medical literature. • Databases for housing and querying the data. Cancer Biotechnology Series
Proteomics tools • Mass spectroscopy tools for peptide identification. • Fragment classification tools for identification of diagnostics • Peptide fragment resolution tools - identification of protein mixtures from peptide sets. • Databases for storing and querying the data. Cancer Biotechnology Series
Inherent bioinformatics problems • Keeping data sources synchronized and up to date. • Keeping applications up to date. • Remaining aware of current palette of available tools and resources. • Separation between computer developers and biologist users of software and databases. • The silo concept- separate dysfunctional units. • Lack of common language or database schema. Cancer Biotechnology Series
Data Analysis • Pathway analysis • Polymorphism • Proteomics • Image analysis • Homology Modeling • Live polymorphism analysis (if time permits) Cancer Biotechnology Series
Pathway Analysis • Identify specific requirements of individual tumor. • Advance to detection from diagnosis. • Multiple points to cause aberrations and multiple points to act to correct them. • Identify/characterize tissue, cell specific targets. Cancer Biotechnology Series
Pathway Gene Set Analysis • Many experiments result in sets of genes, eg microarray, proteomics, literature searches etc. • Clustering genes based on expression etc. provides only first dimension. • View prospective pathways impacted by changes in expression, protein levels, phosphorylation etc. Cancer Biotechnology Series
G5G8Tg1Liver G5G8Tg2Liver G5G8-/-1Liver G5G8-/-2Liver G5G8-/-3Liver
G5G8Tg1Liver G5G8Tg2Liver G5G8-/-1Liver G5G8-/-2Liver G5G8-/-3Liver
Integrative Strategy for Microarray Analysis Microarray Data Clustering Analysis Load into WPS WSCP Unassigned Genes Integrate with WPS Lists of Genes Assign to uncharacterized pathway(s) Assign to known pathway(s) Putative Pathway PSCP PSCP PSCP
Project Goal: Integrate Biological Data and/or Information Databases into Biological Networks User input: Microarray Data, Proteomics Protein Interaction Database (BIND, DIP etc.) Comparative Genomics P1 P2 Protein Modification Phos., Glyco. Gene regulation (Promoter etc) Gene Ontology SNP & Haplotype Database (SNPinfo etc) Literature DB (e.g. Pubgene ResNet) NCBI resources OMIM etc …… Statistical Evaluation Network Expansion (high, low confidence)
One example of analysis scenario microarray data pathway analysis or clustering in local PC Candidate gene sets Candidate pathway sets Pre-computed DBs or Run-time computed Internet-enabled SNP & Haplotype data (SNPinfo; Disease association Promoter Comparsion 1.CGI generator 2.CoreSearch 3 ConsInspector) Protein interaction Literature-based (Pubgene etc NCBI OMIM etc) GO Known gene training Weighted scoring (Statistic analysis, filtering) Final set of candidate genes (visualization and re-creation of the new subnetwork within the whole network) Pathway expansion
Polymorphism Impacts • Variation within species as great as differences between closely related species • Confounds correlation analysis • Impacts gene structure and expression • Start with complete sequence for individual, obtain polymorphism data for populations/strains and breeds etc. • Strains/breeds allow for good start Cancer Biotechnology Series
Polymorphism Types • SNPs • Indels • STRs • Tandem • NonTandem (Copy number variation) • Retroelement • Complex • Inversion/translocation Cancer Biotechnology Series
STR Polymorphism View Cancer Biotechnology Series
Strain Trace and Contig Coverage View Cancer Biotechnology Series
InDel Polymorphism Information View Cancer Biotechnology Series
Location Polymorphism Locator Query Cancer Biotechnology Series
STR Query results Cancer Biotechnology Series
Polymorphism Visualization Cancer Biotechnology Series
Proteomics InitiativeABCC Projects • Disk Storage and Archiving (centralized storage) • LAN Support • Software Development • Spectral Filtering • Clustering/Biomarker Identification • Database Development and Update • Peptide identification DB • MS Integration with Pathways • ABCC Pathway tool • Provide Scalable Computational Resources • Software Optimization • Sequest (working with LPAT,Yates Lab, and Thermoelectron) Cancer Biotechnology Series
Raw Data Binning Biological Marker Clustering Cancer Biotechnology Series