1 / 71

Lecture 2: Predictive Data Mining Fall 2005 Peter van der Putten (putten_at_liacs.nl)

Databases and Data Mining. Lecture 2: Predictive Data Mining Fall 2005 Peter van der Putten (putten_at_liacs.nl). Course Outline. Objective Understand the basics of data mining Gain understanding of the potential for applying it in the bioinformatics domain Hands on experience Schedule

bennier
Download Presentation

Lecture 2: Predictive Data Mining Fall 2005 Peter van der Putten (putten_at_liacs.nl)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databases and Data Mining Lecture 2:Predictive Data MiningFall 2005Peter van der Putten(putten_at_liacs.nl)

  2. Course Outline • Objective • Understand the basics of data mining • Gain understanding of the potential for applying it in the bioinformatics domain • Hands on experience • Schedule • Evaluation • Practical assignment (2nd) plus take home exercise • Website • http://www.liacs.nl/~putten/edu/dbdm05/

  3. Agenda Today • Recap Lecture 1 • A short introduction to life • Data mining explained • Predictive data mining concepts • Classification and regression • Bioinformatics applications • Predictive data mining techniques • Logistic Regression • Nearest Neighbor • Decision Trees • Naive Bayes • Neural Networks • Evaluating predictive models • WEKA Demo (optional) • Lab session • Predictive Modeling using WEKA

  4. What is data mining?

  5. Sources of (artificial) intelligence • Reasoning versus learning • Learning from data • Patient data • Customer records • Stock prices • Piano music • Criminal mug shots • Websites • Robot perceptions • Etc.

  6. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  7. A short summary of life Bio Building Blocks Biotech Data Mining Applications

  8. The Promise…. . . . .

  9. The Promise…. . . . .

  10. Discovering the structure of DNAJames Watson & Francis Crick- Rosalind Franklin

  11. The structure of DNA

  12. DNA Trivia • DNA stores instructions for the cell to peform its functions • Double helix, two interwoven strands • Each strand is a sequence of so called nucleotides • Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides (bases): adenine (A), thiamine (T), cytosine (C) and guanine (G) • Nucleotide uracil (U) doesn’t occur in DNA • Each strand is reverse complement of the other • Complementary bases • A with T • C with G

  13. DNA Trivia • Each nucleus contain 3 x 10^9 nucleotides • Human body contains 3 x 10^12 cells • Human DNA contains 26k expressed genes, each gene codes for a protein in principle • DNA of different persons varies 0.2% or less • Human DNA contains 3.2 x 10^9 base pairs • X-174 virus: 5,386 • Salamander: 100  109 • Amoeba dubia: 670  109

  14. Primary Protein Structure • Proteins are built out of peptides, which are poylmer chains of amino acids • Twenty amino acids are encoded by the standard genetic code shared by nearly all organisms and are called standard amino acids (100 amino acids exist in nature)

  15. Protein Structurefrom Primary to Quaternary Wikipedia

  16. Proteins: 3D Structure A representation of the 3D structure of myoglobin, showing coloured alpha helices. This protein was the first to have its structure solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, which led to them receiving a Nobel Prize in Chemistry. http://en.wikipedia.org/wiki/Protein

  17. Proteins: 3D Structure G Protein-Coupled Receptors (GPCR) represent more than half the current drug targets

  18. From DNA to Proteins

  19. Standard Genetic Code • Each tri-nucleotide unit (‘codon’) codes in the amino acid codes for one amino acid • This code is the same for nearly all living organisms  The Standard Genetic Code Wikipedia

  20. Standard Genetic Code • Each tri-nucleotide unit (‘codon’) codes in the amino acid codes for one amino acid • This code is the same for nearly all living organisms  The Standard Genetic Code Wikipedia

  21. Importance of Combinatorial Gene Control • combinations of a few gene regulatory proteins can generate many different cell types during development

  22. Some working definitions…. • Bioinformatics = • Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data [http://www.bisti.nih.gov/]. • Or more pragmatic: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems [Wikipedia Nov 2005]

  23. NCBI Tools for data mining: • Nucleotide sequence analysis • Proteine sequence analysis • Structures • Genome analysis • Gene expression • Data mining or not?.

  24. Bio informatics and data mining • From sequence to structure to function • Genomics (DNA), Transcriptomics (RNA), Proteomics (proteins), Metabolomics (metabolites) Pattern matching and search • Sequence matching and alignment • Structure prediction • Predicting structure from sequence • Protein secondary structure prediction • Function prediction • Predicting function from structure • Protein localization • Expression analysis • Genes: micro array data analysis etc. • Proteins • Regulation analysis

  25. Bio informatics and data mining • Classical medical and clinical studies • Medical decision support tools • Text mining on medical research literature (MEDLINE) • Spectrometry, Imaging • Systems biology and modeling biological systems • Population biology & simulation • Spin Off: Biological inspired computational learning • Evolutionary algorithms, neural networks, artificial immune systems

  26. Data mining revisited

  27. Genomic Microarrays – Case Study • Problem: • Leukemia (different types of Leukemia cells look very similar) • Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution • Data mining on micro-array data

  28. Microarray data • 50 most important genes • Rows: genes • Columns: samples / patients

  29. Example: ALL/AML data • 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro array gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL AML • Results on test data: • 33/34 correct, 1 error may be mislabeled

  30. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

  31. The Knowledge Discovery Process

  32. Some working definitions…. • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Example: the relation between patient characteristics and the probability to be diabetic • Instances: the individual, independent examples of a concept • Example: a patient, candidate drug etc. • Attributes: measuring aspects of an instance • Example: age, weight, lab tests, microarray data etc • Pattern or attribute space

  33. Data mining tasks • Predictive data mining • Classification: classify an instance into a category • Regression: estimate some continuous value • Descriptive data mining • Matching & search: finding instances similar to x • Clustering: discovering groups of similar instances • Association rule extraction: if a & b then c • Summarization: summarizing group descriptions • Link detection: finding relationships • …

  34. Data Mining Tasks: Classification Goal classifier is to seperate classes on the basis of known attributes The classifier can be applied to an instance with unknow class For instance, classes are healthy (circle) and sick (square); attributes are age and weight weight age

  35. Data Preparation for Classification • On attributes • Attribute selection • Attribute construction • On attribute values • Outlier removal / clipping • Normalization • Creating dummies • Missing values imputation • ….

  36. Examples of Classification Techniques • Majority class vote • Logistic Regression • Nearest Neighbor • Decision Trees, Decision Stumps • Naive Bayes • Neural Networks • Genetic algorithms • Artificial Immune Systems

  37. Example classification algorithm:Logistic Regression • Linear regression • For regression not classification (outcome numeric, not symbolic class) • Predicted value is linear combination of inputs • Logistic regression • Apply logistic function to linear regression formula • Scales output between 0 and 1 • For binary classification use thresholding

  38. Example classification algorithm:Logistic Regression Classification Linear decision boundaries can be represented well with linear classifiers like logistic regression fe weight fe age

  39. Logistic Regression in attribute space Voorspellen Linear decision boundaries can be represented well with linear classifiers like logistic regression f.e. weight f.e. age

  40. Logistic Regression in attribute space Voorspellen xxxx Non linear decision boundaries cannot be represented well with linear classifiers like logistic regression f.e. weight f.e. age

  41. Logistic Regression in attribute space Non linear decision boundaries cannot be represented well with linear classifiers like logistic regression Well known example: The XOR problem f.e. weight f.e. age

  42. Example classification algorithm:Nearest Neighbour • Data itself is the classification model, so no model abstraction like a tree etc. • For a given instance x, search the k instances that are most similar to x • Classify x as the most occurring class for the k most similar instances

  43. Nearest Neighbor in attribute space Classification = new instance Any decision area possible Condition: enough data available fe weight fe age

  44. Nearest Neighbor in attribute space Voorspellen Any decision area possible Condition: enough data available bvb. weight f.e. age

  45. Example Classification AlgorithmDecision Trees 20000 patients age > 67 yes no 1200 patients 18800 patients Weight > 85kg gender = male? yes no no 400 patients 800 customers etc. Diabetic (%50) Diabetic (%10)

  46. Building Trees:Weather Data example KDNuggets / Witten & Frank, 2000

  47. An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node. Building Trees Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes KDNuggets / Witten & Frank, 2000

  48. Split on what attribute? • Which is the best attribute to split on? • The one which will result in the smallest tree • Heuristic: choose the attribute that produces best separation of classes (the “purest” nodes) • Popular impurity measure: information • Measured in bits • At a given node, how much more information do you need to classify an instance correctly? • What if at a given node all instances belong to one class? • Strategy • choose attribute that results in greatest information gain KDNuggets / Witten & Frank, 2000

  49. Which attribute to select? • Candidate: outlook attribute • What is the info for the leafs? • info[2,3] = 0.971 bits • Info[4,0] = 0 bits • Info[3,2] = 0.971 bits • Total: take average weighted by nof instances • Info([2,3], [4,0], [3,2]) = 5/14 * 0.971 + 4/14* 0 + 5/14 * 0.971 = 0.693 bits • What was the info before the split? • Info[9,5] = 0.940 bits • What is the gain for a split on outlook? • Gain(outlook) = 0.940 – 0.693 = 0.247 bits Witten & Frank, 2000

  50. Which attribute to select? Gain = 0.247 Gain = 0.152 Gain = 0.048 Gain = 0.029 Witten & Frank, 2000

More Related