1 / 50

Databases and Data Mining

Databases and Data Mining. Lecture 1: Introduction to Data Mining for Bioinformatics Fall 2005 Peter van der Putten (putten_at_liacs.nl). Course Outline. Objective Understand the basics of data mining Gain understanding of the potential for applying it in the bioinformatics domain

donhoward
Download Presentation

Databases and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databases and Data Mining Lecture 1:Introduction to Data Miningfor BioinformaticsFall 2005Peter van der Putten(putten_at_liacs.nl)

  2. Course Outline • Objective • Understand the basics of data mining • Gain understanding of the potential for applying it in the bioinformatics domain • Limited hands on experience • Schedule • Evaluation • Practical assignment (2nd) plus take home exercise

  3. Agenda Today • What is data mining? • A short summary of life • Data mining revisited

  4. What is data mining?

  5. Genomic Microarrays – Case Study • Problem: • Leukemia (different types of Leukemia cells look very similar) • Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution • Data mining on micro-array data

  6. Example: ALL/AML data • 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro array gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL AML • Results on test data: • 33/34 correct, 1 error may be mislabeled

  7. Sources of (artificial) intelligence • Reasoning versus learning • Learning from data • Patient data • Customer records • Stock prices • Piano music • Criminal mug shots • Websites • Robot perceptions • Etc.

  8. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  9. A short summary of life Bio Building Blocks Biotech Data Mining Applications

  10. The Promise…. . . . .

  11. The Promise…. . . . .

  12. The Promise…. . . . .

  13. DNA, Proteins, Cells

  14. DNA, Proteins, Cells

  15. From DNA to Proteins

  16. Discovering the structure of DNAJames Watson & Francis Crick- Rosalind Franklin

  17. The structure of DNA

  18. DNA Trivia • DNA stores instructions for the cell to peform its functions • Double helix, two interwoven strands • Each strand is a sequence of so called nucleotides • Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides (bases): adenine (A), thiamine (T), cytosine (C) and guanine (G) • Nucleotide uracil (U) doesn’t occur in DNA • Each strand is reverse complement of the other • Complementary bases • A with T • C with G

  19. DNA Trivia • Each nucleus contain 3 x 10^9 nucleotides • Human body contains 3 x 10^12 cells • Human DNA contains 26k expressed genes, each gene codes for a protein in principle • DNA of different persons varies 0.2% or less • Human DNA contains 3.2 x 10^9 base pairs • X-174 virus: 5,386 • Salamander: 100  109 • Amoeba dubia: 670  109

  20. Primary Protein Structure • Proteins are built out of peptides, which are poylmer chains of amino acids • Twenty amino acids are encoded by the standard genetic code shared by nearly all organisms and are called standard amino acids (100 amino acids exist in nature)

  21. Protein Structurefrom Primary to Quaternary

  22. Proteins: 3D Structure A representation of the 3D structure of myoglobin, showing coloured alpha helices. This protein was the first to have its structure solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, which led to them receiving a Nobel Prize in Chemistry. http://en.wikipedia.org/wiki/Protein

  23. Proteins: 3D Structure Molecular surface of several proteins showing their comparative sizes. From left to right are: Antibody (IgG), Hemoglobin, Insulin (a hormone), Adenylate Kinase (an enzyme), and Glutamine Synthetase (an enzyme).

  24. Proteins: 3D Structure G Protein-Coupled Receptors (GPCR) represent more than half the current drug targets

  25. DNA Codes for Proteinsbut Proteins also Control Gene Expression • Protein regulation occurs at each step of synthesis

  26. Repressor Protein Switching Genes On and Off

  27. Regulatory Protein Coordinating Gene Expression

  28. Importance of Combinatorial Gene Control • combinations of a few gene regulatory proteins can generate many different cell types during development

  29. Some working definitions…. • Bioinformatics = • Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data [http://www.bisti.nih.gov/]. • Or more pragmatic: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems [Wikipedia Nov 2005]

  30. NCBI Tools for data mining: • Nucleotide sequence analysis • Proteine sequence analysis • Structures • Genome analysis • Gene expression • Data mining or not?.

  31. Bio informatics and data mining • From sequence to structure to function • Genomics (DNA), Transcriptomics (RNA), Proteomics (proteins), Metabolomics (metabolites) Pattern matching and search • Sequence matching and alignment • Structure prediction • Predicting structure from sequence • Protein secondary structure prediction • Function prediction • Predicting function from structure • Protein localization • Expression analysis • Genes: micro array data analysis etc. • Proteins • Regulation analysis

  32. Bio informatics and data mining • Classical medical and clinical studies • Medical decision support tools • Text mining on medical research literature (MEDLINE) • Spectrometry, Imaging • Systems biology and modeling biological systems • Population biology & simulation • Spin Off: Biological inspired computational learning • Evolutionary algorithms, neural networks, artificial immune systems

  33. Examples of my related research • Topology preserving property of self-organizing maps • Neural network for clustering & classification inspired by cortical maps • Benchmarking Artificial Immune Systems • Predicting throat cancer survival rate • Value of fusing data from various sources for this purpose • Automated recognition of sick yeast cells in images (with prof. Verbeek) • Recommender systems in bioinformatics • Amazon.com style recommendations

  34. Data mining revisited

  35. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  36. Some working definitions…. • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Example: the relation between patient characteristics and the probability to be diabetic • Instances: the individual, independent examples of a concept • Example: a patient, candidate drug etc. • Attributes: measuring aspects of an instance • Example: age, weight, lab tests, microarray data etc • Pattern or attribute space

  37. Data mining tasks • Predictive data mining • Classification: classify an instance into a category • Regression: estimate some continuous value • Descriptive data mining • Matching & search: finding instances similar to x • Clustering: discovering groups of similar instances • Association rule extraction: if a & b then c • Summarization: summarizing group descriptions • Link detection: finding relationships • …

  38. Data Mining Tasks: Search Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. Pattern spaces may be high dimensional (10 to thousands of dimensions) f.e. weight f.e. age

  39. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. weight f.e. age

  40. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible f.e. weight f.e. age

  41. Data Mining Tasks: Classification Goal classifier is to seperate classes on the basis of known attributes The classifier can be applied to an instance with unknow class For instance, classes are healthy (circle) and sick (square); attributes are age and weight weight age

  42. Examples of Classification Techniques • Majority class vote • Machine learning & AI • Decision trees • Nearest neighbor • Neural networks • Genetic algorithms / evolutionary computing • Artificial Immune Systems • Good old statistics • …..

  43. Example Classification Algorithm 1Decision Trees 20000 patients age > 67 yes no 1200 patients 18800 patients Weight > 85kg gender = male? yes no no 400 patients 800 customers etc. Diabetic (%50) Diabetic (%10)

  44. Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income Each line corresponds to a split in the tree Decision areas are ‘tiles’ in pattern space weight age

  45. Example classification algorithm 3:Neural Networks • Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!)) • Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) • Algorithm learns to find optimal weight using the training instances and a general learning rule.

  46. Neural Networks • Example simple network (2 layers) • Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index) age body_mass_index Weightbody mass index weightage Probability of being diabetic

  47. Neural Networks in Pattern Space Classification Simpel network: only a line available (why?) to seperate classes Multilayer network: Any classification boundary possible f.e. weight f.e. age

  48. Descriptive data mining:association rules • Discovery of interesting patterns • Rule format: if A (and B and C etc) then Z • Example: • If customer buys potatoes (A) and sauerkraut (B) then customer buys sausage (Z) • Important measures • Support condition: how often do potatoes and sauerkraut occur together (A,B) • Confidence rule: how often do sausages then occur / support conditions (is A,B  C always true?) • Could be used for instance for mining gene expression data

  49. Quiz Question

  50. What have we learned today • An introduction into applying data mining for bioinformatics • A short history of life • Basic data mining concepts

More Related