150 likes | 284 Views
Data mining in Wikipedia. 2011-09-26 Sinchoo Kim. Terms. Data Unorganized and unprocessed fact Information Data that are processed to be useful Provides answers to "who", "what", "where", and "when" questions Knowledge Application of data and information Answers "how" questions. KDD.
E N D
Data mining in Wikipedia 2011-09-26 Sinchoo Kim
Terms • Data • Unorganized and unprocessed fact • Information • Data that are processed to be useful • Provides answers to "who", "what", "where", and "when" questions • Knowledge • Application of data and information • Answers "how" questions
KDD • KDD (Knowledge Discovery in Database) • Describes the process of automatically searching large volumes of data that can be considered knowledge about the data for patterns • Selection • Preprocessing • Transformation • Data Mining • Interpretation/Evaluation
Data mining • Definition • The analysis step of the Knowledge Discovery in Databases process • Discovering previously unknown pattern • Example • Home equity loan
Case : Home equity loan • Select subset of customer records who have received home equity loan offer
Case : Home equity loan • Find rules to predict whether a customer would respond to home equity loan offer note or note IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES
Case : Home equity loan • Group customers into clusters and investigate clusters Group 3 Group 2 Group 1 Group 4
Case : Home equity loan • Evaluate results • Many “uninteresting” clusters • One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents
Common classes of tasks • Association rule learning • Searches for relationship between variables • Clustering • Discover groups and structures in data are in some way similar • Anomaly detection • Identification of unusual data records
Common classes of tasks • Classfication • Generalizing known structure to apply to new data • Regressions • Find a function which models the data with the least error • Summarizations • Provide a more compact representation of the data set
Notable uses • Business • Customer management • Marketing • Identify purchase pattern • In human resource department • Identifying the characteristics of their most successful employees • In Decision making support • Integrated-circuit production line
Notable uses • Science and engineering • Human genetics • Relation between genetics and deseases • Electrical power engineering • Detect abnormal conditions • Estimate the nature of the abnormalities
Notable uses • Visual Data Mining • Large data set have been generated, collected, and stored • Find trends and information which is hidden in data set
Issues • Reliable data set • Overfitting • Training set which are not present in the general data set
Issues • Privacy concerns and ethics • The term data mining has no ethical implications • Compiled data cause anyone who has access • to the newly compiled data set • to be able to identify specific individuals, especially when originally the data were anonymous