370 likes | 569 Views
Research on Data Mining and Knowledge Discovery at WPI. Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute. Outline of this talk. Short tutorial on Data Mining and Knowledge Discovery in Databases (KDD) Sample ongoing KDD research projects at WPI.
E N D
Research on Data Mining and Knowledge Discovery at WPI Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute
Outline of this talk • Short tutorial on Data Mining and Knowledge Discovery in Databases (KDD) • Sample ongoing KDD research projects at WPI
Need for Data Mining • Data are being gathered and stored extremely fast • Currently, the amount of new data stored in digital computer systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000). • Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996] Raw Data Data Mining Patterns Analytical and Statistical Patterns (rules, decision trees, …) Visual Patterns What is Data Mining?or more generally, Knowledge Discovery in Databases (KDD) Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
data analysis • data mining • analytical • statistical • visual clean data models • data “pre”- • processing • noisy/missing data • dim. reduction data sources • data • management • databases • data warehouses • model/pattern • evaluation • quantitative • qualitative data “good” model • model/patterns • deployment • prediction • decision support new data Data Analysis (KDD)Process
Machine Learning (AI) Contributes (semi-)automatic induction of empirical laws from observations & experimentation Statistics Contributes language, framework, and techniques Pattern Recognition Contributes pattern extraction and pattern matching techniques Databases Contributes efficient data storage, data cleansing, and data access techniques Data Visualization Contributes visual data displays and data exploration High Performance Comp. Contributes techniques to efficiently handling complexity Application Domain Contributes domain knowledge KDD is Interdisciplinarytechniques come from multiple fields
Confirmatory (verification) Given a hypothesis, verify its validity against the data Exploratory (discovery) Prescriptive patterns Patterns for predicting behavior of newly encountered entities Descriptive patterns Patterns for presenting the behavior of observed entities in a human-understandable format Data Mining Modes
Analytical A model that represents the data is constructed using computational methods Visual Data are displayed on computer screen using colors and shapes Patterns in the data are identified by the human (user) eye. Analytical and Visual Data Mining
IF A & B THEN IF A & D THEN 0.5 IF a & b & c THEN d & k IF k & a THEN e A B C D A, B -> C 80% C, D -> A 22% 0.75 0.3 What do you want to learn from your data?KDD approaches regression classification clustering Data change/deviation detection summarization dependency/assoc. analysis
CBALiu et al., National Univ. of Singapore Data Mining Academic Systems WEKAFrank et al., University of Waikato, New Zealand ARMinerCristofor et al., UMass/Boston WPI WEKA - Our Temporal/Spatial Association Rules
Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining • Systems performance Data • Sleep Data • Financial Data • Web Data • Data Mining for Genetic Analysis • Correlating genetic information with diseases • Predicting gene expression patterns • Data Mining for Electronic Commerce • Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks
Mining Complex Data names/aliases bank account age felonies gender iris scan … P1 P2 P3 … Based partially on work w/ Norfolk County Sheriff Office
Sample Complex Patterns Potential temporal/spatial association: • Teenage males from Eastern Massachusetts who are convicted of burglary are likely (7%) to commit violent crimes when they are adults.
Analyzing Sleep Data • Purpose: • Associations between sleep patterns and health/pathology • Obtain patterns of different sleep stages (4 sleep+REM +Wake) • DATA SET • Clinical (sequential) • Electro-encephalogram (EEG), • Electro-oculogram (EOG), • Electro-myogram (EMG), • Probe measuring flow of Oxygen in blood etc. Diagnostic (tabular) • Questionnaire responses • Patient’s demographic info. • Patient’s medical history (Source: http://www. blsc.com) • Potential Rules: • Association Rules • (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% • (B) Classification Rules • (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** • => (Race = Caucasian)confidence=70%, support= 8% • *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI, UMassMedical, BC
Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …
Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data • sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: • If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases
Financial Data Analysis Cisco AMD
Events –Sleep Data6Basic sleep events/stages: W,S1,S2,S3,S4,REM • Sa02: the mean oxygen saturation (SaO2) around 90% • heart rate shown by ECG in beats per minute • the sleep stages - W or Wake, 1 or Stage1, 2, 3, 4 and REM or Rapid Eye Movement stage. • Also shown brown markings are: • Epoch (of duration 30sec) and • Clock time (indicating total sleep time).
Events – Financial DataBasic events: 16 or so financial templates [Little&Rhodes78]difficult pattern matching – alignments and time warping Panic Reversal Head & Shoulders Reversal Rounding Top Reversal Descending Triangle Reversal
Example: Event Identification • Templates =increase , decrease , sustain • Confidence = 90%, support = 15%, class = Epworth illnesses heart rate age oxygen gender Epworth P1 P2 P3 …
Temporal Relations between two Eventsevent1 event2 meets before after overlaps is equal to starts during finishes
Example: temporal association rules • heart rate decreases immediately after oxygen stops increasing & gender=M => epworth=10 (conf=95%, supp= 23%) • HR-dec[t1,t2] & oxygen-inc[t0,t1] & gender=M =>epworth=10 • Heart rate sustains while oxygen increases & patient suffers of dementia => ethnicity=white (conf=99%, supp= 16%) • Patient suffers of dementia and depression & gender=F & REM[t0,t2]=> oxygen-inc[t1,t3] (conf=91%, supp= 17%) t0 t1 t2 t3
Closer Look: WPI WekaTool for mining complex temporal/spatial associations
Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis • discovering correlations between sequence variations and diseases • Gene expression • discovering patterns that cause a gene to be expressed in a particular cell
Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.
Genomic Data Resources Wirth, B. et al. Journal of Human Molecular Genetics
Data Mining Techniques • Association Rule Mining • Metrics for evaluation of mined rules • Confidence P(Consequent | Premise) • Support P(Consequent È Premise) • Lift P(Consequent | Premise) / P(Consequent) • Example: [ ] Ag1-CA, 110 = absent Ag1-CA, 108 = associated Gender = Female Confidence: 100 % Support: 9.364% Lift: 2.39 SMA Type = Severe
promoter sequences neurons muscle n1 n2 n3 m1 m2 m3 10 basepairs 20 basepairs red=ON white=OFF Mining Gene Expression Patterns • Different cells require different proteins • DNA uses a four letter alphabet (ATCG) • Cell expression pattern depends on motifs
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene expression Analysis PR1 PROMOTER(S) CELL TYPES neural neural muscle neural muscle neural neural neural muscle M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell CAGE Gene 1 Gene 3 Gene 2 Seam Cells On Gene 1 Gene 3 Gene 2 Off
Summary • KDD is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” • The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns • Data mining is the discovery and extraction of patterns from data, not the extraction of data • Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data
Data Mining Resources – Books • Advances in Knowledge Discovery and Data Mining. Eds.: Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy. The MIT Press, 1995. • Data Mining: Concepts and Techniques. J. Han and M. Kamber. Morgan Kaufmann Publishers. 2001. • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. I. Witten and E. Frank. Morgan Kaufmann Publishers. 2000. • Data Mining. Technologies, Techniques, Tools, and Trends. B. Thuraisingham. CRC, 1998. • Principles of Data Mining , D. J. Hand, H. Mannila and P. Smyth, MIT Press, 2000 • The Elements of Statistical Learning: Data Mining, Inference, and Prediction, T. Hastie, R. Tibshirani, J. Friedman, Springer Verlag, 2001. • Data Mining Cookbook, modeling data for marketing, risk, and CRM. O. Parr Rud, Wiley, 2001. • Data Mining. A hands-on approach for business professionals. R. Groth. Prentice Hall, 1998. • Data Preparation for Data Mining. Dorian Pyle, Morgan Kaufmann, 1999 • Data Mining Methods for Knowledge Discovery Cios, Pedrycz, & Swiniarski, Kluwer, 1998.
Data Mining Resources – Books (cont.) • Mastering Data Mining, M. Berry & G. Linoff, John Wiley & Sons, 2000. • Data Mining Techniques for Marketing, Sales and Customer Support. Berry & Linoff. John Wiley & Sons, 1997. • Decision Support using Data Mining. S. Anand and A. Buchner. Financial Times Pitman Publishing, 1998 • Feature Selection for Knowledge Discovery and Data Mining. Liu and Motoda, Kluwer, 1998. • Feature Extraction, Construction and Selection: A Data Mining Perpective. Eds: Motoda and Liu. Kluwer, 1998 • Knowledge Acquisition from Databases. Xindong Wu. • Mining Very Large Databases with Parallel Processing. A. Freitas & S. Lavington. Kluwer, 1998. • Predictive Data-Mining: A Practical Guide. Weiss & Indurkhya. Morgan Kaufmann. 1998. • Machine Learning and Data Mining: Methods and Applications. Michalski, Bratko, and Kubat, John Wiley & Sons. 1998. • Rough Sets and Data Mining: Analysis of Imprecise Data. Eds: Lin and Cercone; Kluwer. • Seven Methods for Transforming Corporate Data into Business Intelligence. Vasant Dhar and Roger Stein; Prentice-Hall, 1997.
Data Mining Resources – Journals • Data Mining and Knowledge Discovery Journal Newsletters: • ACM SIGKDD Explorations Newsletter Related Journals: • TKDE: IEEE Transactions in Knowledge and Data Engineering • TODS: ACM Transaction on Database Systems • JACM: Journal of ACM • Data and Knowledge Engineering • JIIS: Intl. Journal of Intelligent Information Systems
Data Mining Resources – Conferences • KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining • ICDM: IEEE International Conference on Data Mining, • SIAM International Conference on Data Mining • PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases • PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining • DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery Related Conferences: • ICML: Intl. Conf. On Machine Learning • IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning • IJCAI: International Joint Conference on Artificial Intelligence • AAAI: American Association for Artificial Intelligence Conference • SIGMOD/PODS: ACM Intl. Conference on Data Management • ICDE: International Conference on Data Engineering • VLDB: International Conference on Very Large Data Bases