1 / 20

Data Mining Motivation: “Necessity is the Mother of Invention”

Data Mining Motivation: “Necessity is the Mother of Invention”. Automated data collection tools and mature database technology have led to tremendous amounts of stored data. We are drowning in data, but starving for knowledge! Solution: Data mining

cmark
Download Presentation

Data Mining Motivation: “Necessity is the Mother of Invention”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Motivation:“Necessity is the Mother of Invention” • Automated data collection tools and mature database technology have led to tremendous amounts of stored data. • We are drowning in data, but starving for knowledge! • Solution: Data mining • Extract interesting rules, patterns, constraints) • (reduce volume, raise information/knowledge levels)

  2. What Is Data Mining? • Data mining: • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information or patterns from data in large databases • Alternative names: • Knowledge discovery in dbs (KDD), knowledge extraction, data/pattern analysis, data prospecting, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? • (Deductive) query processing.

  3. Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management • Other Applications • Text mining (news group, email, documents) and Web analysis. • Intelligent query answering

  4. More Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • 22 quasars discovered with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs to discover customer preference and behaviors, analyzing effectiveness of Web marketing, improving Web site organization, etc.

  5. Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of the knowledge discovery process. Data Mining Classification Clustering ARM Task-relevant Data Data Warehouse Selection Data Cleaning/ Integration: missing data, outliers, noise, errors Feature extraction, attribute selection Databases

  6. Association Rule Mining: The “Walmart”Example Rule: {Diaper, Milk} => Beer (Diaper, Milk, Beer} Support = = 0.4 |D| (Diaper, Milk, Beer} Confidence = = 0.66 (Diaper, Milk}

  7. Precision Ag example: Find image antecedents that imply high yield TIFF image Yield Map High Green reflectance  High Yield (obvious) High (NearInfraRed – Red)  High Yield (higher confidence)

  8. Grasshopper Infestation Prediction • Grasshopper caused significant economic loss last year. • These insects are likely to visit again this year. • Early prediction of the infestation is a key step to • decrease damage. Association rule mining on remotely sensed imagery holds significant potential to achieve early detection. How do we signature initial infestation from RGB bands???

  9. Gene1 Gene3 Gene5 Gene9 Gene6 Gene8 Gene Regulation Pathway Discovery example • Results of clustering may indicated that nine genes are involved in a pathway. • High confident rule mining on that cluster will discover the relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded. Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering ARM Gene4 Gene7 Gene2

  10. Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines

  11. Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

  12. Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19

  13. Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19

  14. bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19

  15. Peano Count Tree (P-tree) • P-trees are a lossless representation of data in a compressed, recursive quadrant-orientation. • NDSUholds patents on P-tree Technology

  16. 55 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 16 16 8 8 15 15 16 16 3 3 0 0 4 4 1 1 4 4 4 4 3 3 4 4 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 An example of Ptree • Peano or Z-ordering • quadrant • Root Count

  17. 001 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 16 8 15 16 2 3 0 4 1 4 4 3 4 3 111 1 1 1 0 0 0 1 0 1 1 0 1 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) An example of Ptree • Pure (Pure-1/Pure-0) quadrant • Root Count • Level • Fan-out • QID (Quadrant ID)

  18. Tuple Count Cube (T-cube) The (v1,v2,v3)th cell of the T-cube contains the Root Count of P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3

  19. 25 15 2,0 2,1 32 40 19.2 24 5 19 sums 30 34 24 27.2 thresholds 1,0 1,1 High confidence Association Rules • Assume minimum confidence threshold 80%, • minimum support threshold 10% • Start with 1-bit values and 2 bands, B1 and B2 C: B1={0} => B2={0} c = 83.3%

  20. The End                  Thank you |:~)                 

More Related