270 likes | 418 Views
Data Mining I: KnowledgeSEEKER. Jennifer Davis Kelly Davis Saurabh Gupta Chris Mathews Shantea Stanford. Overview of Presentation. Introduction to Data Mining Methods and Products Tutorial: How to Use KnowledgeSEEKER? Exercises: How much did you learn?. What is Data Mining?.
E N D
Data Mining I: KnowledgeSEEKER Jennifer Davis Kelly Davis Saurabh Gupta Chris Mathews Shantea Stanford
Overview of Presentation • Introduction to Data Mining Methods and Products • Tutorial: How to Use KnowledgeSEEKER? • Exercises: How much did you learn?
What is Data Mining? • Filtering large amounts of data • Searching for hidden patterns and/or trends • Predicting future results • Creating a competitive advantage and improving decision making • Data mining is a form of artificial intelligence, but is very different from other BI tools. • Discovery versus Verification
What Sparked Data Mining? • “Motivated by business need, large amounts of available data, and humans’ limited cognitive processing abilities • Enabled by data warehousing, parallel processing, and data mining algorithms” Source: Dr. Hugh Watson
Popular Data Mining Methods • Neural networks – learning from data patterns and predicting new data • Genetic Algorithms – optimizing techniques • Decision trees – rules for classifying data • Regression Analysis - statistical • K-nearest neighbor – classifying and clustering technique based on weighting of selected variables • Data Visualization – visually showing patterns
Types of Data Mining • Association – identifies relationships • Sequential pattern – identifies sequencing • Classifying – identifies potential outcomes forpredetermined categories • Clustering – identifies categories • Prediction – estimatesfuture values or forecasts
Data Mining Process • “Requires personnel with domain, data warehousing, and data mining expertise • Requires data selection, data extraction, data cleansing, and data transformation • Most data mining tools work with highly granular flat files • Is an iterative and interactive process” Source: Dr. Hugh Watson
How Data Mining Is Used? • CRM: Research, churn and promotional management. • Process Mgmt: Reduce operational delays. • Analysis: Develop forecasting models and fraud prevention. • Predictive Capabilities: Develop rules for queries or expert systems and oil exploration. • Health Care: Medical research and trends. • Banking: Identify bank locations. • Sports: Guide movement of players.
Data Mining Products • See product list, http://www.xore.com/prodtable.html • According to Jackie Sweeney, International Data Corporation, “Data mining has matured, producing fortunes for the Big Three vendors - SPSS, IBM and SAS Institute - and robust revenues for a number of smaller vendors who market solutions tailored to vertical markets.”
Data Mining Products • Off-the-shelf applications and bundling are becoming more common. • Wide range of pricing • SAS Institute’s Enterprise Miner ~ $80k • IBM Intelligent Miner ~ $60k • Angoss KnowledgeSEEKER = $4,750 per license, including upgrades and unlimited tech support for 1 year. Annual license renewal fees are 20% of the list price. • Desktop products start at few hundred dollars
Selection Process – Questions to Ask? • Are the data and variables currently available? • Will mining involve numerical and nominal data? • Can the tool build models, predict outcomes and verify results? • Can it process the amount of data required? • Can the tool handle incomplete data? • Can the tool process noisy data? • Can it provide the degree of granularity desired? • How much technical knowledge is required?
KnowledgeSEEKER by Angoss • Angoss Software Corp = Canadian public company specializing in data mining solutions • Decision tree modeling • Fully scalable and easy to use • Specifications • Operating Systems: Unix, Windows 3.1, 95, 98 and NT. • Databases: Access, dBase II, III and IV, ODBC, SAS, SPSS.
Users of KnowledgeSEEKER • IRS – fraud detection • University of Rochester – Cancer research • Hewlett Packard – process and quality control • Readers’ Digest – market segmentation • MGM Grand – survey analysis
Sources • Angoss Whitepaper: http://www.angoss.com/ProdServ/ AnalyticalTools/kseeker/whitepaper.html • “Data Mining for Golden Opportunities”, Smart Computing, January 2000 • “Your Business Intelligence Arsenal”, Telephony, ChicagoApr 24, 2000, Douglas Hackney • Examples and testimonials: http://www.data-mining-software.com/data_mining_examples.htm • Data Management, Richard T. Watson, 2002 • http://www.xore.com/prodtable.html (Data Mining Products) • Dr. Hugh Watson’s slide • “Data Mining Gets Real”, Enterprise Systems Journal,April 1999, Jon William Toigo • http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm (examples of Data Mining uses)
KnowledgeSEEKER Exercises • According to KnowledgeSeeker, which is the most important variable influencing hypertension for those between the ages of 51-62 who are “regular” or “occasional” smokers? Answer - Cheese Last Week
KnowledgeSEEKER Exercises • What is the total number of 51-62 year olds who have identified themselves as “former/never smokers” and have an eating pattern that includes “a lot/moderate salt?” Answer – 32
KnowledgeSEEKER Exercises • What percent of women between the ages of 32-50 who occasionally drink have high hypertension? Answer - 28.6%
KnowledgeSEEKER Exercises • What is the percent of people in income group 4,5,7, and 8, age bracket 32-50, who have high hypertension? Answer - 11.8%
KnowledgeSEEKER Exercises • In the sample data, how many people have never smoked before? Answer - 94
KnowledgeSEEKER Exercises • What is the most important factor contributing to hypertension according to KnowledgeSeeker for those in the 51-62 age bracket? Answer - Smoking Next by right clicking and selecting “Go to Split” find the 4th most important factor from the table. Answer - Deep fried last week
KnowledgeSEEKER Exercises • What is the percentage of males who are “regular” smokers among all male participants? Answer - 30.8%
KnowledgeSEEKER Exercises • Create a graph of the distribution of smoking males.
KnowledgeSEEKER Exercises • Complete the following steps: Dependent variable – Hypertension Click on Grow / Automatic What is the total number of males between the ages of 63-72 who had fish last week? Answer – 24
KnowledgeSEEKER Exercises • What is the next split after age that has the highest effect on hypertension according to KnowledgeSeeker? Answer - Height
KnowledgeSEEKER Exercises • Among 32-50 year olds who report a drink pattern of former/never, how many have high hypertension? Answer - 0
KnowledgeSEEKER Exercises • According to KnowledgeSeeker, what is the most important variable influencing hypertension for women between the ages of 51-62? How is this different from males age 51-62? Women – weight Men - drinking pattern