620 likes | 633 Views
7.11. 24./26.10. 14.11. Home Exam. 30.10. 21.11. 28.11. Course on Data Mining (581550-4). Intro/Ass. Rules. Clustering. Episodes. KDD Process. Text Mining. Appl./Summary. Accepted to Autumn 2001 Course. Arkko Jouko Asikainen Tomi Aunimo Lili Hyvönen Leena Johansson Carl
E N D
7.11. 24./26.10. 14.11. Home Exam 30.10. 21.11. 28.11. Course on Data Mining (581550-4) Intro/Ass. Rules Clustering Episodes KDD Process Text Mining Appl./Summary
Accepted to Autumn 2001 Course • Arkko Jouko • Asikainen Tomi • Aunimo Lili • Hyvönen Leena • Johansson Carl • Jokinen Sakari • Kerminen Antti • Kuokkanen Ville • Lehmussaari Kari • Lehtonen Miro • Löfström Jaakko • Malinen Johanna • Mäkelä Eetu • Ojala Petri • Palin Kimmo • Pasanen Janne • Pietilä Mikko • Pitkänen Esa • Rapiokallio Maarit • Roos Teemu • Sahlberg Mauri • Saikku Arja • Sundman Jonas • Tarvainen Tero • Tiihonen Sami • Tolvanen Juha • Uusitalo Petri • Vasankari Minna • Virtanen Otso
Course Organization Lecturers Lectures CourseMaterial Exercises Contents
Course Organization • PhD Mika Klemettinen: • Email: Mika.Klemettinen@nokia.com • WWW: http://www.cs.helsinki.fi/u/mklemett/ • Room: B356 • Tel: 050-483 6661 • PhD in January 1999: • Thesis: A Knowledge Discovery Methodology for Telecommunication Network Alarm Databases • Data mining and SGML/XML related research at UH/CS (1994-2000) and at Nokia (2000-) Dr. Mika Klemettinen
Course Organization • PhD Pirjo Moen: • Email: Pirjo.Moen@cs.helsinki.fi • WWW: http://www.cs.helsinki.fi/pirjo.moen/ • Room: B350 • Tel:191 44238 • PhD in February 2000: • Thesis: Attribute, Event Sequence, and Event Type Similarity Notions for Data Mining • Data mining related research at UH/CS (1994-) Dr. Pirjo Moen
Course Organization • RATI (A structured text database system/ Rakenteiset tekstitietokannat), 1988-91 • Data mining from telecommunication alarm data, 1994-97 • Structured and Intelligent Documents (SID), 1995-98 • From Data to Knowledge (FDK), 1995- • Knowledge worker’s workstation (TYTTI), 2000-02 • DM Group (99), DOREMI Group (00) DM/SGML/XML at UH/CS Linux was invented here!
Course Organization • Nokia is the global leader in digital communication technologies with around 60 000 employees all over the world • Nokia Research Center (NRC) has around 1 200 employees in Finland, USA, Japan, China, Germany, Hungary, UK, etc. • NRC's role is to enhance the Nokia's technological competitiveness by exploring and developing new technologies • Strongly involved in many European Union and national research projects NRC in Short
Course Organization • Background: • At the University of Computer Science data mining methods and theory of data mining since late 80´s • Association and episode rule mining, time series similarity, analysis of telecommunication alarm data and web logs, etc. • Other members include: • Dr. Heikki Mannila (group leader) • Dr.Hannu Toivonen DM Group at NRC
Course Organization • 24.10.-30.11.2001 (12 lectures): • 7 normal lectures • 5 seminar like lectures • Wed 14-16, Fri 12-14 (A217): • Wed: normal lecture • Fri: seminar like lecture (except for 26.10.) • Lectures are obligatory: • Normal lectures: 5/7 • Seminar like lectures: 4/5 • Lists are circulated Lectures (1)
Course Organization • Lecturing language is Finnish, slides are in English: • Students can also use English • A foreign student group can be established • Normal lectures: • Basics, terminology, standard methods • Lecturer driven teaching • Seminar like lectures: • Extensions to the basic methods • Lecturer gives an introduction • Student groups give short presentations Lectures (2)
Course Organization • Group for seminar (and exercise) work: • 10 groups, à 3 persons, 2 groups/lecture • Dates are agreed at the beginning of course • Articles are given on previous week's Wed • Seminar presentations: • Presentation in an HTML page (around 3-5 printed pages) due to seminar starting: • Can be either a HTML page or a printable document in PostScript/PDF format • 30 minutes of presentation • 5-15 minutes of discussion • Active participation Lectures (3)
Course Organization • Lecture slides • Original articles • Seminar presentations • Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8 • Remember to check course website and folder for the material! Course Material
Course Organization • Given by Pirjo Moen: • Email: Pirjo.Moen@cs.helsinki.fi • Room: B350 • Tel: 191 44238 • 1.11.-29.11.2001 (5 exercises) • Thu 12-14 (A318) • Exercises are obligatory: • Exercises: 4/5 • Lists are circulated • Discussion is an essential part! Exercises
Course Organization • Usually around 3-4 exercises: • 2-3 "normal" exercises (with subtasks): • Available due Thu mornings at 9 • 1 group work: • A practical exercise • Available due Thu mornings at 9 • A written report (not hand-written!) must be returned at the exercise session • Group = the seminar presentation group • Foreign students: • Return all exercises in written format to Pirjo Moen Exercises
Course Organization • The home exam is given on 28.11.2001 • Must be returned by 21.12.2001 (printed version, not hand-written, not by email) • Tentatively: • Course lectures, seminar presentations and exercises are the material for the exam • Questions contain both theoretical and practical issues • Around 4-6 smaller questions • Around 1-2 bigger questions Home Exam
Course Organization • Scale: 1-/3 … 3/3 or rejected • Grade = home exam + exercises + experiments + group presentations: • home exam: max 30 points • (4 X 5p) + (1 X 10p) • normal exercises (10): max 5 points • 2: 1p, 4: 2p, 6: 3p, 8: 4p, 10: 5p • experiments (5): max 15 points • max 3 points/experiment • group presentation: max 10 points Course Evaluation
Course Organization • Passing the course: min 30 points • home exam: min 13 points (max 30 points) • exercises/experiments: min 8 points (max 20 points) • at least 3 returned and reported experiments • group presentation: min 4 points (max 10 points) • Remember also the other requirements: • Attending the lectures (5/7) • Attending the seminars (4/5) • Attending the exercises (4/5) Course Evaluation
Course Organization • Module/Week 1: • What is Data Mining? • Association rules • 24.10. normal lecture by Mika • 26.10. normal lecture by Mika • Module/Week 2: • Recurrent patterns • Episode rules, minimal occurrences • 31.10. normal lecture by Mika • 2.11. seminar like lecture by Pirjo Course Contents (1)
Course Organization • Module/Week 3: • Text mining • 7.11. normal lecture by Mika • 9.11. seminar like lecture by Mika • Module/Week 4: • Clustering • Classification • Similarity • 14.11. normal lecture by Pirjo • 16.11. seminar like lecture by Mika Course Contents (2)
Course Organization • Module/Week 5: • Knowledge discovery process • Pre- and postprocessing • 21.11. normal lecture by Pirjo • 23.11. seminar like lecture by Pirjo • Module/Week 6: • Data mining tools • Summary, future • 28.11. normal lecture by Pirjo • 30.11. seminar like lecture by Pirjo Course Contents (3)
Course Organization / Groups • Group is for both seminar and weekly group exercise work • 10 groups à 3 persons Group Establishment Get grouped!
Course Organization / Groups • Group presentation time allocation: • Fri 2.11.: Group 1, Group 2 (associations) • Fri 9.11.: Group 3, Group 4 (episodes) • Fri 16.11.: Group 5, Group 6 (text mining) • Fri 23.11.: Group 7, Group 8 (clustering) • Fri 30.11.: Group 9, Group 10 (KDD process)
Course Organization / Groups • Group 1: • Asikainen Tomi, Hyvönen Leena • Group 2: • Löfström Jaakko, Pitkänen Esa, Tarvainen Tero • Group 3: • Jokinen Sakari, Kuokkanen Ville, Tolvanen Juha • Group 4: • Lehmussaari Kari, Pietilä Mikko, Uusitalo Petri • Group 5: • Johansson Carl, Kerminen Antti, Sundman Jonas
Course Organization / Groups • Group 6: • Malinen Johanna, Sahlberg Mauri, Vasankari Minna • Group 7: • Arkko Jouko, Ojala Petri, Rapiokallio Maarit • Group 8: • Palin Kimmo, Pasanen Janne (, X) • Group 9: • Aunimo Lili, Lehtonen Miro, Saikku Arja • Group 10: • X, X, X
Introduction to Data Mining (DM) What? Why? Applications KDD Process DM Views Major Issues
Personal Home Network in 2000s Storage Storage Storage Storage Storage Storage Storage Internet Storage
Evolution of Database Technology • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: • Data mining and data warehousing, multimedia databases, and Web technology
Why Data Mining? • Enormous amounts of data available: • Automated data collection tools and mature database technology lead to huge amounts of data stored in databases, data warehouses and other information repositories • Manual inspection is either tedious or just impossible
What is Data Mining? • Ultimately: • "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases" • Often just: • "Tell something interesting about this data", "Describe this data" • Exploratory, semi-automatic data analysis on large data sets
What is Data Mining? • Rather established terminology: • Data mining • Usually DM is one part of KDD process • Knowledge discovery in databases (KDD) • The general term that covers, e.g., data preprocessing, DM, and post-processing • Not so often used terms: • Knowledge extraction, data archeology • Newest hype: • Business intelligence, knowledge management
Marketing Database Marketing KDD & Data Mining Data Warehousing What is DM Useful for? Increase knowledge to base decision upon E.g., impact on marketing The role and importance of KDD and DM has growed rapidly - and is still growing! But DM is not just marketing...
Potential Applications? • Database analysis and decision support: • Market analysis and management • Risk analysis and management • Fraud detection and management • Other applications: • Web mining • Text mining • etc.
Example (1) • You are a marketing manager for a cellular telephone company: • Customers receive a free phone (worth 150€) with one-year contract; you pay a sales commission of 250€ per contract • Problem: Turnover (after contract expires) is 25% • Giving a new phone to everyone whose contract is expiring is very expensive • Bringing back a customer after quitting is both difficult and expensive
Example (1) • Three months before a contract expires, predict which customers will leave: • If you want to keep a customer that is predicted to leave, offer them a new phone Yippee! I won't leave!
Example (2) • You are an insurance officer and you should define a suitable monthly payment for an 18-year-old boy who has bough a Ferrari … what to do? Oh, yes! I love my Ferrari!
Example (2) • Analyze all previous customer data and paid compensations data • What is the predicted accident probability based on… • Driver's gender (male/female) and age • Car model and age, place of living • etc. • If the accident probability is higher than on average, set the monthly payment accordingly!
Example (3) • You are in a foreign country and somebody steals or duplicates your credit card or mobile phone … • Credit card companies … • use historical data to build models of fraudulent behaviour and use data mining to help identify similar instances • Phone companies … • analyze patterns that deviate from an expected norm (destination, duration, etc.)
Example (4) • Web access logs can be analyzed for … • discovering customer preferences • improving Web site organization • Similarly … • all kinds of log information analysis • user interface/service adaptation Excellent surfing experience!
Knowledge Discovery Process (1) Learning the domain Creating a target data set Data cleaning/preprocessing Data reduction/projection Choosing the DM task
Knowledge Discovery Process (2) Choosing the DM algorithm(s) Data mining: Search Pattern evaluation Knowledge presentation Use of discovered knowledge
Time based selection Raw data Eval. of interes- tingness Selection Selection Preprocessing Postprocessing Cleaned Verified Focused 3 1 Selected usable patterns Typical KDD Process Operational Database Data mining Input data Results 2 Utilization
Utilization Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
The Value Chain • Decision • Promote product A in region Z. • Mail ads to families of profile P • Cross-sell service B to clients C • Knowledge • A quantity Y of product A is used in region Z • Customers of class Y use x% of C during period D • Information • X lives in Z • S is Y years old • X and S moved • W has money in Z • Data • Customer data • Store data • Demographical Data • Geographical data
Data Mining Views • General approaches: • Descriptive data mining: • Describe what interesting can be found in this data! • Explain this data to me! • Predictive data mining: • Based on this and previous data, tell me what will happen in the future! • Show me the future trends!
Data Mining Views • Views based on … • Databases to be mined • Knowledge to be discovered • Techniques utilized • Applications adapted • Let's take a closer look at these views...
Data Mining Views • Relational • Transactional • Object-oriented • Object-relational • Active • Spatial • Time-series Databases to be mined • Text, XML • Multi-media • Heterogeneous • Legacy • Inductive • WWW • etc. Databases
Data Mining Views • Characterization • Discrimination • Association • Classification • Clustering • Trend Knowledge to be mined = tasks • Deviation analysis • Outlier analysis • etc. Knowledge = task
Data Mining Views • Database-oriented • Data warehouse (OLAP) • Machine learning • Statistics • Visualization • Neural networks • Etc. Techniques utilized Techniques
Data Mining Views • Retail (supermarkets etc.) • Telecom • Banking • Fraud analysis • DNA mining Applications adapted • Stock market analysis • Web mining • Log data analysis • etc. Applic.