Data Mining CSCI 307, Spring 2019 Lecture 1

Data MiningCSCI 307, Spring 2019Lecture 1 Course Introduction Introduction to Data Mining Bonferroni’s Principle

Course Description Data Mining is the discovery of useful, possibly unexpected, patterns in data. We will learn techniques for discovery, interpretation, and visualization of patterns in large collections of data.

Course Description Topics include: Classification Rule-based learning Decision Trees Association rules Data Visualization Theoretical Basis: (We will not cover in depth) AI Machine Learning Statistical Analysis Etc.

Course Webpage Information about the course can be found at the course webpage: http://mathcs.holycross.edu/~csci307 This site contains: Basic information about the course The course schedule (approximate) Lecture notes Assignments

Project, in brief Work with your study group: • Choose a data mining problem to solve. • Find a dataset, convert it, and clean it. • Choose a model and several algorithms. • Explain the testing and results. • Write a report and create a presentation.

Why Data Mining? Explosive Data Growth From terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

What is Data Mining? • Discovery of useful, possibly unexpected, patterns in data. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Subsidiary issues: • Data cleaning: detection of bogus data. • e.g., age = 150. • Visualization: something better than multiple megabyte files of output.

What is Data Mining? • Alternative Names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Data integration Normalization Feature selection Dimension reduction Pattern evaluation Pattern selection Pattern interpretation Pattern visualization KDD Process: A Typical View from ML and Statistics Pattern Information Knowledge A view from typical machine learning and statistics communities Data Mining Post-Processing Data Pre-Processing Input Data Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …

Uses of Data Mining 1) Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region 2) Association and Correlation Analysis e.g. What kinds of items are purchased together in Walmart? 3) Classification e.g. Classify countries based on climate, or classify cars based on gas mileage

Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Data Mining Visualization Applications Algorithm Database Technology High-Performance Computing Why? Tremendous amount of data; High complexity of data; New and sophisticated applications

Data Mining: Hybrid of Areas • Databases: Structured Query Language, Relational Data, Scalability • Algorithms: Design, Complexity, Data Structures • Machine Learning: Decision trees, Neural Networks, Genetic Algorithms, etc. • Statistics: Bayes Theorem, Regression, Time Series Analysis

Social Implications • Privacy: Advertising targeted at you, no longer "lost in the numbers” (Might we recommend…) • Profiling: Credit card risk, Terrorist risk, etc.

Meaningfulness of Answers • A big data-mining risk is that you will "discover" patterns that are meaningless. • Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find meaningless results.

Examples of Bonferroni’s Principle • A big objection to TIA (Total Information Awareness) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents' privacy. • The Rhine Paradox: A great example of how not to conduct scientific research.

The "TIA" StoryExample from Professor Jeff UllmanStanford University • Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. • We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.

The Details • 109 people being tracked.* • 1000 days. • Each person stays in a hotel 1% of the time (10 days out of 1000). • Hotels hold 100 people (so 105 hotels). • If everyone behaves randomly (i.e., no evil-doers) will the data mining detect anything suspicious? * This is less than the number of people in China (or India for that matter).

Ullman Calculations – (1) • Probability that given persons p and q will be at the same hotel on given day d: • Probability that p and q will be at the same hotel on given days d1 and d2: • Pairs of days:

UllmanCalculations – (2) • Probability that p and q will be at the same hotel on some two days: • Pairs of people: • Expected number of "suspicious" pairs of people:

Conclusion • Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. • Analysts have to sift through quarter million candidates to find the 10 real cases. • Not gonna happen. • But how can we improve the scheme?

Moral • When looking for a property (e.g., "two people stayed at the same hotel twice"), make sure that the property does not allow so many possibilities that random data will surely produce facts "of interest."

Rhine Paradox – (1) • Joseph Rhine was a parapsychologist in the 1950's who hypothesized that some people had Extra-Sensory Perception. • He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue. • He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

Rhine Paradox – (2) • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. WHAT DID HE CONCLUDE? MORAL: • Understanding Bonferroni’s Principle may help you avoid similar misinterpretations of results.

Data Mining CSCI 307, Spring 2019 Lecture 1

Data Mining CSCI 307, Spring 2019 Lecture 1

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

DATA MINING LECTURE 1

Data Mining Lecture 1

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007