1 / 34

Computer Science and Bioinformatics

James Edwards and Rajinder Singh Bhatti. Computer Science and Bioinformatics. http://www.csee.umbc.edu/~smer1/bioinformatics.gif. Biology and Computer Science?. Initially Biology depended on Chemistry to make major strides Biochemistry

kermit-burt
Download Presentation

Computer Science and Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. James Edwards and Rajinder Singh Bhatti. Computer Science and Bioinformatics http://www.csee.umbc.edu/~smer1/bioinformatics.gif

  2. Biology and Computer Science? • Initially Biology depended on Chemistry to make major strides • Biochemistry • Biology then needed to work at atomic level explaining phenomena • Biophysics • The modern era of Biology needs to interpret a wealth of data, tools that only computer Science is able to provide • Hence Bioinformatics

  3. What is Bioinformatics? • The study of computational methods to expand the use of biological data (Data Orientated). • Often (incorrectly) used instead of the term ‘Computational Biology’. However this is a slightly different discipline. • Computational Biology is the use of computational and mathematical methods to study or simulate biological systems (Hypothesis Orientated). [source National Institutes of health]

  4. Overlaps Between the two Disciplines • 1 – Bioinformatics problems • 2 – Computational Biology problems • 3 – Problems in both categories • 4 – Problems in neither category 3 1 2 4

  5. Motivation for Bioinformatics • Quote from Donald Knuth- 1974 Turing Award winner: • “…I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on. It’s at that level.” [source – Wikiquotes] • Can Biological life be equated with Computing? • Results so far would suggest the answer is yes!

  6. Common Bioinformatics Problems • Finding and assessing Similarities between Strings (next slides). • Detecting patterns in strings. • Constructing trees of the evolution of organisms. • Classifying new data by clustering existing data. • Also applications of Machine Vision to detect interactions between proteins

  7. Prokaryotes Eukaryotes ............ Reptiles Birds …… …… …… …… Data Structures Used in Biology • Strings for representing sequences (e.g. DNA, RNA, Amino Acid Sequences). “ATACGGCGCGCAAGGCT” “TATGCCGCGCGTTCCGA” • Trees for representing the evolution of organisms and other purposes.

  8. Data Structures (Cont..)‏ 1 • Graphs can represent signalling pathways (often found in Neural networks). 1 1 0 1 2 1 • 3d Points and their Linkages can represent protein structures.

  9. ATCACCGTAAGAGGA ATCACCGTAAGAGGA ATCACCGTAAGAGGA ATCACCGTAAGAGGA First Instance of a Problem – DNA Shotgun Sequencing • In order to derive a DNA sequence, the DNA must first be duplicated many times. • It must then be processed by Gel Electrophoresis, which ‘chops’ the DNA into smaller pieces named ‘fragments’. ATC AAGA CCGT TCA ATCACCGTAAGAGGA AGGA TAA ATCACCGTAAGAGGA AAG GTA ATCACCGTAAGAGGA CCGT This is a very simplified Instance of the problem typically each fragment can be between 250 and 1000 Bases long.

  10. A T C A 0 0 0 T 0 1 0 A 0 0 0 Alignments – the Smith Waterman Method. • How do we identify fragments which link together? • Can use dynamic programming to compute optimal alignment scores between fragments. • Align with either match (1) gap -(1/3 x length of gap) or mismatch (-1). • The score in each cell is the best total score from an already chosen cell/row + the cost of the alignment. If a score is < 0 it is said to be 0. • The first row is always filled with 0’s

  11. A T C A 0 0 0 T 0 1 0 A 0 0 0 The Smith Waterman algorithm (Continued) • Following this trace back a path through the optimum alignment starting at the highest number in the matrix to the first 0. • In this case it is: ‘AT’ • Algorithm extremely expensive O(NM) run time and O(NM) storage complexity. • Always finds optimum solution.

  12. Alternative to SW Algorithm • Sequences are usually at the very least tens of thousands of characters long • Makes O(NM) runtime (and storage complexity) unacceptable. • Alternative – use BLAST (Basic Local Alignment Search Tool) Algorithm. • Gives a much more reasonable run time of O(N+M). • However does not always compute best solution.

  13. BLAST Algorithm • Computing an entire matrix of values will always require N x M space. • Iterating over values will always require N x M Space. • Solution: Ignore parts of the alignment which are unlikely to improve the score. • This improves the Storage Complexity as only a singular alignment must be stored. • It also improves the Runtime Complexity as at each stage of the algorithm only the optimum so far is processed.

  14. BLAST Illustrated • The strings at the beginning and end are very unlikely to improve the score of the alignment. • Therefore no gap and mismatches are computed in the matrix • Consider forming an alignment between two sequences: CTCTCTCTCATTGATTGCGGGGGG GGGGGGGGGATTGATTGCCCCCCC ---------ATTGATTGC------ ---------ATTGATTGC------

  15. Alignments Relation to Shotgun Sequencing. • So now there is a way to measure which fragments are likely to align we still need a way to find the correct order efficiently. • In depth Algorithm beyond scope of presentation • However the best current techniques are: • Greedy Methods (align every element – then use only best solutions). • Evolutionary Algorithms (start with initial set of solutions, computing sum of alignment scores then ‘evolve’ set of solutions in each iteration). • Problem is NP- Hard – Techniques give Approximations.

  16. Relating Computer Science to Biology • What have us Computer Science students studied so far in this MSc course that can have some use to Bioinformatics? • Data Mining • Artificial Intelligence • Heuristic approaches (e.g. Knowledge Representation – Logics)‏ • Algorithm Techniques

  17. Data Mining and BioinformaticsHow and why? • Some of you do COMP 527 Data Mining with Rob • Why Data Mining is essential in Bioinformatics. • KDD (Knowledge Discovery DB) is the process of finding useful information and patterns in data. • Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. • Graphical Techniques such as Brush, Data smoothing etc.

  18. Data Mining and BioinformaticsAlgorithm implementation examples • Data Mining algorithm use for tackling problems in Bioinformatics • In conjunction with microarray Technology • Predict a patients outcome, such as • survival time • disease recurrence • health risk assessments etc... • How does Data Mining help? • Accurate predictions could help provide better treatment!

  19. AI and BioinformaticsArtificial Intelligence? • Research in genetics, molecular biology etc. generate enormous amounts of data • Use AI to extract useful information from the wealth of available data • Build good probabilistic models (gene models)‏ • AI provides several powerful algorithms and techniques solving these problems using the stored data

  20. AI and BioinformaticsAI techniques used • Neural networks (Biological and Artificial)‏ • Hidden Markov models (Probabilistic Statistical models)‏ • Bayesian networks (Models logic)‏ • and many others....

  21. Logic and Bioinformatics • Biology works by applying prior knowledge “what is known” to unknown entities. • Therefore Biology said to be knowledge-based (rather than axiom based)‏ • Use pre-existing knowledge to make inferences about the item under investigation. • Description Logic?

  22. Description Logic and Bioinformatics • Why description Logic? • decidable logic with good systems • impossible for a single biologist to deal with all of a domains knowledge! • similar to programmers writing extremely complex programs without an IDE to help with libraries • medical diagnosis systems make good use of ABOX and TBOX assertions • for example, determine if a patients problem is an element of a particular known disease

  23. Description Logic Example TBOX sick person isInfected.Cancer non_sick person isInfected.Cold ABOX Tim : person Steven : person Cancer : Problem Cold : Problem (Tim, Cancer) : isInfected (Steven, Cold) : isInfected

  24. Improvements How far has Bioinformatics come? • “One is struck both by how far the field has come in a relatively short period of time, and also by how far it has yet to go.” - Jessica D. Tenenbaum • The discipline of Bioinformatics has vastly improved over recent years due to • Fast technological development of the computer industry • Demand for Computer Scientists - more computer scientists than ever before! • Biological “unknown” discoveries – things that are discovered with no previous knowledge base • Growing of sub-Biology interests, such as molecular Biology

  25. Improvements How far will Bioinformatics go? • Thoroughly depends if the gap between Biology and Computer Science increases or decreases • The gap increases if educational institutions decide ignore Bioinformatics • Put emphasis on prospective students • Computer Scientists choose to ignore Biology • Biologists choose to ignore Computer Science

  26. Closing the gap I • Biologists cannot build their own analytical tools • Computer Scientists don't know what to build!

  27. Closing the gap II • Putting a Computer Scientist (Data Mining expert) into a room with a Biologist investigator wont solve the problem • Boundaries such as methodologies and discipline language are a problem.

  28. Closing the gap III • Computer Science is the “science of the artificial” • Biology is the “science of discovery” • The only way to bridge the gap is for both parties to learn the basic fundamentals of each science

  29. Breakthroughs of Bioinformatics • Spatial patterns of structures for understanding protein folding, evolution, and biological functions • To predict protein functions, we develop a method by rapidly matching local surfaces and by incorporating evolutionary information specific to individual binding region via a Bayesian Monte Carlo approach. • These kinds of breakthroughs encourage the computer industry to get involved and work with Biology.

  30. Related Problems?Are there any other disciplines which involve the similar integration of Computer Science with Biology? • Cheminformatics/Chemoinformatics • the application of informatics tools to solve discovery chemistry problems • an integral component of hit and lead generation • development of new computational methods or efficient algorithms for chemical software, and pharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery

  31. Related Problems? Are there any other disciplines which involve the similar integration of Computer Science with Biology? • Other similar interests • Ecoinformatics • Geoinformatics • Quantum informatics • Astroinformatics • Business informatics • And many others...

  32. Follow ups of Jacques Cohen • Bioinformatics—an introduction for computer scientistsis a previous publication from Jacques Cohen • aims to encourage Computer Scientists to get involved with Biology • Updating Computer Science Education released after Bioinformatics and Computer Science • Talks about encouraging the next generation of Computer Scientists that Computer Science is more than just programming.

  33. Who is Jacques Cohen? • Currently serving Brandies University since 1968. • Docter in the field of analysis of algorithms, parsing and compiling, memory management, logic and constraint logic programming, and parallelism • Recently started researching his interest of Bioinformatics • His most recent publication is about methods used in microarray Data Interpretation • See http://www.cs.brandeis.edu/~jc/publications.html

  34. References and related material(All web links last accessed 4th February 2008)‏ • Shotgun sequencingG. Luque, E. Alba Torres and S. Khuri, Assembling DNA Fragments with a Distributed Genetic Algorithm, Parallel Computing for Bioinformatics and Computational Biology, Wiley-Interscience, New Jersey, 2006, Chapter 12, pp. 285-302. • L.D. Paulson, Bioinformatics Experiences Important Breakthroughs, 2005, pp. 26-27 • J. Cohen, Bioinformatics: An Introduction for Computer Scientists, ACM Computing Surveys, 36(2), 122-158, 2004. • B. Tjaden, J. Cohen, A Survey of Computational Methods used in Microarray Data Interpretation, Applied Mycology and Biotechnology, Bioinformatics 6, 2006. • J. Cohen, Updating Computer Science Education, Communications of the ACM, 48(6), 29-31, 2005. • J. Cohen, Computational Molecular Biology: A Promising Application Using Logic Programming and Constraint Logic Programming, Lecture Notes in Artificial Intelligence, 1999. • R. Stevens, C.A. Gobleand S. Bechhofer, Ontology-based Knowledge Representation for Bioinformatics, 2000. • Jinyan Li, Limsoon Wong and Qiang Yang, Data Mining in Bioinformatics, 2005. • Various material about Bioinformatics, http://www.aaai.org/AITopics/html/bioinf.html • Data Mining in Bioinformatics,http://www.dbs.informatik.uni-muenchen.de/Forschung/Bioinformatics/

More Related