1 / 54

Opportunities of Systems Engineering/Operations Research in Bioinformatics

Opportunities of Systems Engineering/Operations Research in Bioinformatics. Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU). Outline. Introduction on Bioinformatics Paradigm Shift in Biology Systems Engineering/Operations Research for Bioinformatics

marlee
Download Presentation

Opportunities of Systems Engineering/Operations Research in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Opportunities of Systems Engineering/Operations Research in Bioinformatics Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)

  2. Outline • Introduction on Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Conclusions

  3. What is Bioinformatics/Computational Molecular Biology? • An application of mathematical, statistical, and computational tools in the analysis of the huge size biological data • Most of the cases, it involves analyzing information stored in large databases • Multi-disciplinary: -Biology -Mathematics -Statistics -Physics -Chemistry -Computer Science -Engineering It has not yet found its own natural home department

  4. Why Bioinformatics? • Current data analysis tools are far from being efficient for analyzing vast amount of biological data • The pace of biological understanding is much slower than the pace of the technology advance that have powered experimental discovery and data collection • Benefits: Advances in detection and treatment of disease and the production of genetically engineered foods Profound impact on health and medicine

  5. Three Elements of Bioinformatics Research • Significant Biological problems • Gene, motif, signal recognition • Protein structure prediction • Metabolic pathway deduction • Etc. Bioinformatics • Theory & Methods • Algorithms • Statistical Methods • Ontologies • Etc. • Data • Microarrays • Mass Spectroscopy • Etc.

  6. Prerequisites of Bioinformatics Scientific Mind + • Basic knowledge in Molecular Biology • Prokaryotic and Eukaryotic cells • Genes, Codons, DNA, RNA, Central dogma of biology • Etc. • Computing Skills • Program Languages: Python, Perl, Java, etc. • Knowledge in Relational Databases, etc. • Other Skills • General Statistical Knowledge • Optimization Tools: Math Programming, Network Optimization, etc.

  7. Various Problems in Bioinformatics Standard Problems • DNA and Protein Sequence Analysis • Gene Finding and Prediction • Etc. • Microarray Experiment and Data Analysis • Protein Structure Prediction • Deduction of Metabolic Pathways • And more… Emerging Problems

  8. Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Concluding Remarks

  9. Paradigm Shift in Biology • The Human Genome Project (HGP) • Working Draft of the human genome (2001) • Goal of the HGP = sequencing of the human genome • Hypothesis driven reductionism discovery science approach • Drive-forced the development of high throughput technologies and computer applications to transmit, analyze, and model very large size data sets

  10. Paradigm Shift in Biology • High-throughput Technologies • Microarrays – allow the expression of thousands of genes to be surveyed at one time • Protein Arrays – can examine all proteins in a cell and check if they are interacting under designed conditions • Mass Spectrometry – The basic modality is protein mass fingerprinting

  11. genes genes Similarity Matrix Or Distance Matrix samples genes “Raw Data” Expression Level Paradigm Shift in Biology • Microarray Chip Technology • Allows data collection in high-throughput manner • Can put all genes in a microbe on a chip • Interpretation of the data is very challenging

  12. 253x15154 Microarray Gene Expression Data: 162 cancer vs 91 normal patients

  13. Paradigm Shift in Biology Genes and proteins Protein-protein interaction data Gene activity data Black box Protein structure data Proteomic data Regulatory elements Metabolite data Gigantic amount of biological information is hidden in these data and their inter-data relationship!

  14. Paradigm Shift in Biology • Concept of Systems Biology • The Reductionist paradigm has been phenomenally successful in biology since 1950’s • Genomics era exhaustive lists of biological parts (i.e. genes and proteins) together with their functional characteristics • A System-level perspective is required to make sense of how all of these individual parts emerge and act collectively to perform a biological function

  15. Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research tools for Bioinformatics • About Funding Opportunities • Concluding Remarks

  16. Systems Engineering/Operations Research tools • Three Categories • Network & Optimization • Combinatorial • Integer Programming • Dynamic Programming • Network Optimization • Minimum Spanning Tree • Etc. • Statistics • MLE • Regression • Sampling • Linear Model • Cross Validation • Statistical Estimation and Test • Multivariate Analysis (or ANOVA) • Wavelet Transformation • Bayesian Networks, Etc. • Stochastics • Hidden Markov Models • MCMC • Simulation Models • Etc.

  17. Systems Engineering tools for Bioinformatics • Some Examples • Hidden Markov Model for Gene Finding • Dynamic Programming for Sequence Alignment • Integer Programmingfor Protein Folding • Minimum Spanning Tree approach to Clustering for Motif Identification (Xu et al. (2001) • And many more …

  18. A Significant Biological Problem • Identification of Transcription Factor Binding Sites(Motifs) • A gene’s transcriptional level is regulated by proteins (transcription factors), which bind to specific sites in the gene’s promoter region, called binding sites • The binding-site identification problem is to find short “conserved” fragments, from a set of genomic sequences •  Features of transcription factor binding site • These short DNA fragments in the upstream regions of genes are generally very similar to each other • Relatively high frequencies compared to other sequence fragments

  19. Data Collection • Data Set (D)= Set of All Short DNA fragments in the upstream regions of genes • Microarray gene expression technologies allow simultaneous view of the transcription levels of many thousands of genes under various cellular conditions Upstream regions of genes GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG CCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGT CCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAAT GAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGC GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGAT AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC ACAGCTACTGCAAACAACCATGATGATGATACATCTCTTG GGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGG GGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGG ACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCC CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC CACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCA CAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCA GCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGG Experiment Find group of genes having correlated expression profiles

  20. Some testing data sets are available on the internet or in the literature • For example  • CRP binding sites: 18 sequences with 105 BPs • Yeast binding sites: 8 sequences with 1000 BPs • Human binding sites: 113 sequences with 30 BPs

  21. A C T G CRP binding sites: 18 sequences with 105 BPs

  22. Theory & Methods • Traditional approaches • Various sampling techniques including Gibbs sampling • EM Algorithm • Greedy Algorithm • Multi-Order Markov Chain Algorithm • All these are heuristic algorithms so this problem remains as a challenging and unsolved problem

  23. Brief Review: Minimum Spanning Tree • Input = A graph, G = (V,E), with weighted edges • Output = the cheapest subset of edges that keeps the graph in one connected component • Two Popular Algorithms • Prim’s Algorithm • Kruskal’s Algorithm

  24. Theory & Methods • Minimum Spanning Tree approach • Step1: Define a distance measure () on the data set (D), and compute distances b/w each pair of data points (i.e., (A,B) for all A, B in D) • Higher the sequence similarity b/w two fragments, smaller the distance is b/w their mapped positions

  25. Theory & Methods • Minimum Spanning Tree approach • Step2: Find the MST ,T, representing D with its edge weight defined by  and treat it as a data clustering problem c1 c4 T e1 c2 e2 e3 c3 Remove three edges e1,e2,e3 4 Clusters, c1~c4, are identified

  26. Evaluation of the MST Method • Comparison with Other Methods • MST is based on a combinatorial approach •  can identify all clusters of possible binding sites • While existing heuristic methods are likely to miss some clusters • Implemented result is at least as good as results by other methods • While Simple structure of a tree facilitates efficient implementations of rigorous algorithm

  27. Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research tools for Bioinformatics • About Funding Opportunities • Concluding Remarks

  28. Funding Overviews by Funding Institutions(Top)/Field of Research(Bot) Total of $54.1 billion in FY2004 Environmental science Physical science Life science Engineering $9.1 billion $29.3 billion Percentage of Total Federal Funding: Preliminary 2004 Statistics Source: National Science Foundation/Division of Science Resources Statistics, Survey of Federal Funds for Research

  29. How to Search for Funding Opportunities? • NIH Computer Retrieval of Information on Scientific Projects (CRISP) • http://crisp.cit.nih.gov • NIH Office of Extramural Research (OER) • http://grants1.nih.gov • Other Websites • http://www.grants.gov • http://fedgrants.gov • http://www.nsf.gov/pubsys/ods/index.html

  30. Growing Opportunities in Bioinformatics From CRISP Search Data

  31. NIH Funded Projects in 2004 From CRISP Search Data • Searched all Related Institutes, Centers, and States for the 2004 Fiscal Year # NIH Grants in Bioinformatics, 826 Systems Biology, 80 grants Microarray, 214 grants Cancer,63 grants

  32. NIH Funding Opportunities for 2004 ~ From http://grants1.nih.gov • 2004 Program Announcement (PA) • Total 171 PAs • Larger variety of topics • Cancer most prevalent topic • Many wish to have “multidisciplinary” outlook on topics • 2005 Requests For Application (RFA) • Total 68 RFAs • Although listed for 2005, some application deadlines have passed • 2 directly related to bioinformatics • Cancer still most prevalent topic

  33. Outline • Introduction to Bioinformatics • Paradigm Shift in Biology • Systems Engineering/Operations Research for Bioinformatics • About Funding Opportunities • Conclusions

  34. Developing Potential Research Plans • Two Takeaways • Systems Engineers/Operations Research Society already have tools to solve various bioinformatics problems • Moneys are there to support your research Then, what do we need to start? Biological Problems to solve

  35. Concluding Remarks!! • The main driving force of bioinformatics/computational biology is the high-throughput data production • I.E. tools together with computing power can play an important role in this process • Funding opportunities in this area are very rich

  36. Thank you! Any Questions?

  37. Level of Organization and Related Field of Study

  38. DNA RNA Protein Central Dogma of Biology Transcription Translation example Transcription Translation TTG CTG CGG UUG CUG CGG Leu Leu Arg

  39. Transcription and Translation

  40. Gene • A gene is a region of DNA that controls a hereditary characteristic, usually corresponding to a single mRNA carrying the information for constructing a protein. • The human genome contains about 30,000 genes. (February 2001)

  41. Introns and Exons

  42. Pair-wise Sequence Alignment VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG

  43. Sequence Alignment • Purposes: • Learn about evolutionary relationships • Finding genes, domains, signals … • Classify protein families (function, structure). • Identify common domains (function, structure).

  44. Multiple Sequence Alignment

  45. actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Scoring Systems for Alignment Simple case Sequence 1 Sequence 2 A G C T A1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Scoring matrix Match: 1 Mismatch: 0 Score = 5 DNA

  46. Scoring Systems for Alignment Complex case Sequence 1 Sequence 2 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM C S T P A G N D. . C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . . Scoring matrix T:G = -2 T:T = 5 Score = 48 Protein

  47. Alpha-helices Beta-sheets Tertiary Quaternary Diitrogenase as an example Protein Structure

  48. Public Databases • Big 3 Centers National Center for Biotechnology Information EBI DNA Database Bank of Japan

More Related