540 likes | 616 Views
Opportunities of Systems Engineering/Operations Research in Bioinformatics. Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU). Outline. Introduction on Bioinformatics Paradigm Shift in Biology Systems Engineering/Operations Research for Bioinformatics
E N D
Opportunities of Systems Engineering/Operations Research in Bioinformatics Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)
Outline • Introduction on Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Conclusions
What is Bioinformatics/Computational Molecular Biology? • An application of mathematical, statistical, and computational tools in the analysis of the huge size biological data • Most of the cases, it involves analyzing information stored in large databases • Multi-disciplinary: -Biology -Mathematics -Statistics -Physics -Chemistry -Computer Science -Engineering It has not yet found its own natural home department
Why Bioinformatics? • Current data analysis tools are far from being efficient for analyzing vast amount of biological data • The pace of biological understanding is much slower than the pace of the technology advance that have powered experimental discovery and data collection • Benefits: Advances in detection and treatment of disease and the production of genetically engineered foods Profound impact on health and medicine
Three Elements of Bioinformatics Research • Significant Biological problems • Gene, motif, signal recognition • Protein structure prediction • Metabolic pathway deduction • Etc. Bioinformatics • Theory & Methods • Algorithms • Statistical Methods • Ontologies • Etc. • Data • Microarrays • Mass Spectroscopy • Etc.
Prerequisites of Bioinformatics Scientific Mind + • Basic knowledge in Molecular Biology • Prokaryotic and Eukaryotic cells • Genes, Codons, DNA, RNA, Central dogma of biology • Etc. • Computing Skills • Program Languages: Python, Perl, Java, etc. • Knowledge in Relational Databases, etc. • Other Skills • General Statistical Knowledge • Optimization Tools: Math Programming, Network Optimization, etc.
Various Problems in Bioinformatics Standard Problems • DNA and Protein Sequence Analysis • Gene Finding and Prediction • Etc. • Microarray Experiment and Data Analysis • Protein Structure Prediction • Deduction of Metabolic Pathways • And more… Emerging Problems
Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Concluding Remarks
Paradigm Shift in Biology • The Human Genome Project (HGP) • Working Draft of the human genome (2001) • Goal of the HGP = sequencing of the human genome • Hypothesis driven reductionism discovery science approach • Drive-forced the development of high throughput technologies and computer applications to transmit, analyze, and model very large size data sets
Paradigm Shift in Biology • High-throughput Technologies • Microarrays – allow the expression of thousands of genes to be surveyed at one time • Protein Arrays – can examine all proteins in a cell and check if they are interacting under designed conditions • Mass Spectrometry – The basic modality is protein mass fingerprinting
genes genes Similarity Matrix Or Distance Matrix samples genes “Raw Data” Expression Level Paradigm Shift in Biology • Microarray Chip Technology • Allows data collection in high-throughput manner • Can put all genes in a microbe on a chip • Interpretation of the data is very challenging
253x15154 Microarray Gene Expression Data: 162 cancer vs 91 normal patients
Paradigm Shift in Biology Genes and proteins Protein-protein interaction data Gene activity data Black box Protein structure data Proteomic data Regulatory elements Metabolite data Gigantic amount of biological information is hidden in these data and their inter-data relationship!
Paradigm Shift in Biology • Concept of Systems Biology • The Reductionist paradigm has been phenomenally successful in biology since 1950’s • Genomics era exhaustive lists of biological parts (i.e. genes and proteins) together with their functional characteristics • A System-level perspective is required to make sense of how all of these individual parts emerge and act collectively to perform a biological function
Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research tools for Bioinformatics • About Funding Opportunities • Concluding Remarks
Systems Engineering/Operations Research tools • Three Categories • Network & Optimization • Combinatorial • Integer Programming • Dynamic Programming • Network Optimization • Minimum Spanning Tree • Etc. • Statistics • MLE • Regression • Sampling • Linear Model • Cross Validation • Statistical Estimation and Test • Multivariate Analysis (or ANOVA) • Wavelet Transformation • Bayesian Networks, Etc. • Stochastics • Hidden Markov Models • MCMC • Simulation Models • Etc.
Systems Engineering tools for Bioinformatics • Some Examples • Hidden Markov Model for Gene Finding • Dynamic Programming for Sequence Alignment • Integer Programmingfor Protein Folding • Minimum Spanning Tree approach to Clustering for Motif Identification (Xu et al. (2001) • And many more …
A Significant Biological Problem • Identification of Transcription Factor Binding Sites(Motifs) • A gene’s transcriptional level is regulated by proteins (transcription factors), which bind to specific sites in the gene’s promoter region, called binding sites • The binding-site identification problem is to find short “conserved” fragments, from a set of genomic sequences • Features of transcription factor binding site • These short DNA fragments in the upstream regions of genes are generally very similar to each other • Relatively high frequencies compared to other sequence fragments
Data Collection • Data Set (D)= Set of All Short DNA fragments in the upstream regions of genes • Microarray gene expression technologies allow simultaneous view of the transcription levels of many thousands of genes under various cellular conditions Upstream regions of genes GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG CCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGT CCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAAT GAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGC GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGAT AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC ACAGCTACTGCAAACAACCATGATGATGATACATCTCTTG GGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGG GGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGG ACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCC CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC CACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCA CAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCA GCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGG Experiment Find group of genes having correlated expression profiles
Some testing data sets are available on the internet or in the literature • For example • CRP binding sites: 18 sequences with 105 BPs • Yeast binding sites: 8 sequences with 1000 BPs • Human binding sites: 113 sequences with 30 BPs
A C T G CRP binding sites: 18 sequences with 105 BPs
Theory & Methods • Traditional approaches • Various sampling techniques including Gibbs sampling • EM Algorithm • Greedy Algorithm • Multi-Order Markov Chain Algorithm • All these are heuristic algorithms so this problem remains as a challenging and unsolved problem
Brief Review: Minimum Spanning Tree • Input = A graph, G = (V,E), with weighted edges • Output = the cheapest subset of edges that keeps the graph in one connected component • Two Popular Algorithms • Prim’s Algorithm • Kruskal’s Algorithm
Theory & Methods • Minimum Spanning Tree approach • Step1: Define a distance measure () on the data set (D), and compute distances b/w each pair of data points (i.e., (A,B) for all A, B in D) • Higher the sequence similarity b/w two fragments, smaller the distance is b/w their mapped positions
Theory & Methods • Minimum Spanning Tree approach • Step2: Find the MST ,T, representing D with its edge weight defined by and treat it as a data clustering problem c1 c4 T e1 c2 e2 e3 c3 Remove three edges e1,e2,e3 4 Clusters, c1~c4, are identified
Evaluation of the MST Method • Comparison with Other Methods • MST is based on a combinatorial approach • can identify all clusters of possible binding sites • While existing heuristic methods are likely to miss some clusters • Implemented result is at least as good as results by other methods • While Simple structure of a tree facilitates efficient implementations of rigorous algorithm
Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research tools for Bioinformatics • About Funding Opportunities • Concluding Remarks
Funding Overviews by Funding Institutions(Top)/Field of Research(Bot) Total of $54.1 billion in FY2004 Environmental science Physical science Life science Engineering $9.1 billion $29.3 billion Percentage of Total Federal Funding: Preliminary 2004 Statistics Source: National Science Foundation/Division of Science Resources Statistics, Survey of Federal Funds for Research
How to Search for Funding Opportunities? • NIH Computer Retrieval of Information on Scientific Projects (CRISP) • http://crisp.cit.nih.gov • NIH Office of Extramural Research (OER) • http://grants1.nih.gov • Other Websites • http://www.grants.gov • http://fedgrants.gov • http://www.nsf.gov/pubsys/ods/index.html
Growing Opportunities in Bioinformatics From CRISP Search Data
NIH Funded Projects in 2004 From CRISP Search Data • Searched all Related Institutes, Centers, and States for the 2004 Fiscal Year # NIH Grants in Bioinformatics, 826 Systems Biology, 80 grants Microarray, 214 grants Cancer,63 grants
NIH Funding Opportunities for 2004 ~ From http://grants1.nih.gov • 2004 Program Announcement (PA) • Total 171 PAs • Larger variety of topics • Cancer most prevalent topic • Many wish to have “multidisciplinary” outlook on topics • 2005 Requests For Application (RFA) • Total 68 RFAs • Although listed for 2005, some application deadlines have passed • 2 directly related to bioinformatics • Cancer still most prevalent topic
Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Conclusions
Developing Potential Research Plans • Two Takeaways • Systems Engineers/Operations Research Society already have tools to solve various bioinformatics problems • Moneys are there to support your research Then, what do we need to start? Biological Problems to solve
Concluding Remarks!! • The main driving force of bioinformatics/computational biology is the high-throughput data production • I.E. tools together with computing power can play an important role in this process • Funding opportunities in this area are very rich
Thank you! Any Questions?
DNA RNA Protein Central Dogma of Biology Transcription Translation example Transcription Translation TTG CTG CGG UUG CUG CGG Leu Leu Arg
Gene • A gene is a region of DNA that controls a hereditary characteristic, usually corresponding to a single mRNA carrying the information for constructing a protein. • The human genome contains about 30,000 genes. (February 2001)
Pair-wise Sequence Alignment VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG
Sequence Alignment • Purposes: • Learn about evolutionary relationships • Finding genes, domains, signals … • Classify protein families (function, structure). • Identify common domains (function, structure).
actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Scoring Systems for Alignment Simple case Sequence 1 Sequence 2 A G C T A1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Scoring matrix Match: 1 Mismatch: 0 Score = 5 DNA
Scoring Systems for Alignment Complex case Sequence 1 Sequence 2 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM C S T P A G N D. . C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . . Scoring matrix T:G = -2 T:T = 5 Score = 48 Protein
Alpha-helices Beta-sheets Tertiary Quaternary Diitrogenase as an example Protein Structure
Public Databases • Big 3 Centers National Center for Biotechnology Information EBI DNA Database Bank of Japan