200 likes | 239 Views
Data Mining Motivation: “Necessity is the Mother of Invention”. Automated data collection tools and mature database technology have led to tremendous amounts of stored data. We are drowning in data, but starving for knowledge! Solution: Data mining
E N D
Data Mining Motivation:“Necessity is the Mother of Invention” • Automated data collection tools and mature database technology have led to tremendous amounts of stored data. • We are drowning in data, but starving for knowledge! • Solution: Data mining • Extract interesting rules, patterns, constraints) • (reduce volume, raise information/knowledge levels)
What Is Data Mining? • Data mining: • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information or patterns from data in large databases • Alternative names: • Knowledge discovery in dbs (KDD), knowledge extraction, data/pattern analysis, data prospecting, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? • (Deductive) query processing.
Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management • Other Applications • Text mining (news group, email, documents) and Web analysis. • Intelligent query answering
More Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • 22 quasars discovered with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs to discover customer preference and behaviors, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of the knowledge discovery process. Data Mining Classification Clustering ARM Task-relevant Data Data Warehouse Selection Data Cleaning/ Integration: missing data, outliers, noise, errors Feature extraction, attribute selection Databases
Association Rule Mining: The “Walmart”Example Rule: {Diaper, Milk} => Beer (Diaper, Milk, Beer} Support = = 0.4 |D| (Diaper, Milk, Beer} Confidence = = 0.66 (Diaper, Milk}
Precision Ag example: Find image antecedents that imply high yield TIFF image Yield Map High Green reflectance High Yield (obvious) High (NearInfraRed – Red) High Yield (higher confidence)
Grasshopper Infestation Prediction • Grasshopper caused significant economic loss last year. • These insects are likely to visit again this year. • Early prediction of the infestation is a key step to • decrease damage. Association rule mining on remotely sensed imagery holds significant potential to achieve early detection. How do we signature initial infestation from RGB bands???
Gene1 Gene3 Gene5 Gene9 Gene6 Gene8 Gene Regulation Pathway Discovery example • Results of clustering may indicated that nine genes are involved in a pathway. • High confident rule mining on that cluster will discover the relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded. Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering ARM Gene4 Gene7 Gene2
Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines
Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19
Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19
Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19
bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 Spatial Data Formats (Cont.) • BAND-1 • 54 127 • (1111 1110) (0111 1111) • 4 193 • (0000 1110) (1100 0001) • BAND-2 • 7 240 • (0010 0101) (1111 0000) • 00 19 • (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19
Peano Count Tree (P-tree) • P-trees are a lossless representation of data in a compressed, recursive quadrant-orientation. • NDSUholds patents on P-tree Technology
55 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 16 16 8 8 15 15 16 16 3 3 0 0 4 4 1 1 4 4 4 4 3 3 4 4 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 An example of Ptree • Peano or Z-ordering • quadrant • Root Count
001 55 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 16 8 15 16 2 3 0 4 1 4 4 3 4 3 111 1 1 1 0 0 0 1 0 1 1 0 1 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) An example of Ptree • Pure (Pure-1/Pure-0) quadrant • Root Count • Level • Fan-out • QID (Quadrant ID)
Tuple Count Cube (T-cube) The (v1,v2,v3)th cell of the T-cube contains the Root Count of P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3
25 15 2,0 2,1 32 40 19.2 24 5 19 sums 30 34 24 27.2 thresholds 1,0 1,1 High confidence Association Rules • Assume minimum confidence threshold 80%, • minimum support threshold 10% • Start with 1-bit values and 2 bands, B1 and B2 C: B1={0} => B2={0} c = 83.3%
The End Thank you |:~)