• 180 likes • 325 Views
Data, Databases, and Discovery. Andy Novobilski, PhD UT Chattanooga Computer Science N ut s and Bolts Research Methods Symposium UT College of Medicine Chattanooga September 29, 2006. An Introduction to Knowledge Discovery. Data Collection Data Validation Preprocessing of Data
E N D
Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and BoltsResearch Methods Symposium UT College of Medicine Chattanooga September 29, 2006
An Introduction to Knowledge Discovery • Data Collection • Data Validation • Preprocessing of Data • Mining the Data • Comparing Methods
Data Collection … • Paper or Electronic? • Fingernet • Continuous or Discrete? • And the Understatement of the Year …Health Insurance Portability and Accountability Act of 1996The HIPAA website http://www.hipaa.org/ links to the government’s website http://aspe.hhs.gov/admnsimp/ which states“Administrative Simplification in the Health Care Industry”
… And Raw Storage … • Alphanumeric Data • Excel Worksheets • Comma/Tab Delimited Text Files • XML: The Extensible Markup Language • http://www.xml.com/ • Binary Data • Images • GIF, BMP, EPS • Streaming Data • HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7) • DICOM - http://medical.nema.org/
… Stored in a Relational Manner • Relational Databases • Inexpensive • MS Access • Expensive • MS SQL Server, Oracle, Sybase, … • Free (sort of … open source) • MySQL, PostgreSQL • Licensing Varies by Usage
Data Validation • Patient 002 is a … • Pregnant Male ( hit the 9 instead of 0) • With Ice Water in His Veins (misplaced decimal) • Who Might or Might Not Smoke (missing data)
Preprocessing the Data • Clean-up • Out of Scope vs. Out of Family • Feature Extraction • Data Aggregation • Feature Transformation • Normalization • Principle Component Analysis
Turning Data into Information • Data Mining … • Clustering • Decision Trees • Neural Networks • Bayesian Networks
Clustering K-Means Y N Y Y Y N N Y N N N N
Decision Trees • Division of Data Based on Information Gain • White Box Gender M F Smoker Age N Y Age Y N N Y N Y Y
Neural Networks • Functional Approximation to Data • Black Box • Most Common is Feed Forward, Back Propagation • Considerations in Training the Network • Many Types of Neural Networks • Difficulties with Discrete Data • Missing Data Requires Careful Consideration Case Data Forecast
Bayesian Networks • Belief Networks • White Box • Causal Orientation • Beliefs are Updated Based on New Information • Nodes Can Serve as Both Evidence and Query Points • Handles Missing Data Gracefully
An Example • Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes Related to Acute Coronary Syndrome." ." The 17th International FLAIRS Conference 2004.
Comparing Models – The ROC Curve • The Receiver Operating Characteristic (ROC) Curve • Plots the Percentage of True Positives against the Percentage of False Positives as the Cutoff Value is varied from everyone classified as ill to everyone classified as healthy. • Provides a consistent measure of model fitness that varies between 0 and 100.
An Illustration Healthy Cutoff Value Ill
In Summary … • A Process to Consider … • Collect, Validate, Preprocess, Mine, Compare • Excellent Software is Available • Both Commercial and Open Source • Sample Data Is Available
Thank You ! • Questions and/or Comments are Welcome … Dr. Andy NovobilskiUT Chattanooga Computer Science 615 McCallie Ave., Dept. 2302 Chattanooga, TN 37403 (423) 425-4202 Andy-Novobilski@utc.edu http://www.utc.edu/Faculty/Andy-Novobilski