1 / 26

CSE/CBS 572 Data Mining

In this interactive course, you will learn about the principles and applications of data mining through paper readings, discussions, projects, and presentations. Topics include classification, clustering, association, and more.

lampkins
Download Presentation

CSE/CBS 572 Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE/CBS 572 Data Mining Huan Liu, CSE, CEAS, ASU http://www.public.asu.edu/~huanliu/DM06S/cse572.html CSE572: Data Mining by H. Liu

  2. CSE 591 • Contents of basic and advanced topics Classification, Clustering, Association, and Applications • Format – An interactive course with ample opportunities to work, create and share Paper reading, discussion, project, presentation • Assessment Class participation, assignments, project proposal, presentations, exam(s) CSE572: Data Mining by H. Liu

  3. You: our future data miner • TA: Trevor Lei Tang, l.tang@asu.edu • Me: Huan Liu, huanliu@asu.edu • Where: Brickyard 566 • When: see on the class website, or by appointment • “No pain, no gain”, or “As you sow, so you shall reap”, we will also learn the principle of “No Free Lunch”. • MyASU will be used, so make sure your email address is correct & won’t miss important announcement CSE572: Data Mining by H. Liu

  4. Course Format • What is the effective teaching of graduate data mining ? Your help is wanted. • Research papers - the main categories to be found on the course web site. • You can choose one of the textbooks listed. It is an entering point for you to access related subjects. It is a fast changing field. • Everyone is expected to read research papers and participate in class discussion. • Selected research paper presentation. • Project presentations will be evaluated during presentation. CSE572: Data Mining by H. Liu

  5. Point distribution (tentative) • Projects (35%) • Reading/presentation assignment (10%) • Exam(s) (40%) • Assignments (15%), and class participation, quizzes (up to 10% extra credit) • Late penalty, YES, increased exponentially. • Academic integrity (http://www.public.asu.edu/~huanliu/conduct.html) CSE572: Data Mining by H. Liu

  6. Research paper reading • We provide a reading list and you can also choose your favorite • All are expected to search for and read the selected papers. • What is it about (e.g., key idea, basic algorithm)? • What are points to discuss and improve? • What can we do with it? • What to submit? (see more on the class website) • A brief report that describes the above and 2 questions suitable for quizzes/tests with solutions • A set of presentation slides for 20 minutes • Due date: TBA, use digital drop box • Grading criteria include (1) quality of additional papers you select, (2) slides for presentation, (3) the report, and (4) oral presentation will be selected among the best submissions and presenters will be given extra credit based on presentation • Presentation can start as early as in February, we hope - CSE572: Data Mining by H. Liu

  7. Project • Proposal • Proposal presentation, discussion, revision • A project that can be completed in a semester • Project • Class presentation and/or demo • Report • One key goal of this course is to take advantage of your intelligence and experience to create something useful and with impact CSE572: Data Mining by H. Liu

  8. Topic Distribution (tentative) CSE572: Data Mining by H. Liu

  9. Categories of interests (including design and implementation) • Data and application security • Data mining and privacy • Data reduction and selection • Streaming data reduction • Dealing with large data (column- & row-wise) • Search bias, overfitting • Learning algorithms • Ensemble methods • Semi-supervised learning • Active learning and co-training • Bioinformatics for CBS 572 or others • A discussion board will be created CSE572: Data Mining by H. Liu

  10. Your first assignment • Think about what you want to accomplish. • List 2 your areas of interests (don’t be restricted by the previous list). • Pick an area of interest and choose a general topic for paper presentation. • Submission via MyASU or hardcopy • Complete the above and submit it in the 3rd class (next Wednesday 1/25). CSE572: Data Mining by H. Liu

  11. 2nd Assignment due on Feb 1 • First, choose your category of interest • Second, form presentation groups (3-4 a group) • Third, each group picks a paper from the given list of papers and find additional 2 high-quality relevant papers • Submit it through myASU • TA will help you and compile a list of all papers at the end • Write a summary for each paper including • What is it about • Why is it significant and relevant • Where is it published and when CSE572: Data Mining by H. Liu

  12. Introduction • The need for data mining • Data mining • Text mining • Image mining • Web mining (log, link, content) • Bioinformatics • Many products and abundant applications • Where do we stand CSE572: Data Mining by H. Liu

  13. What is data mining • Data mining is • extraction of useful patterns from data sources, e.g., databases, texts, web, image. • the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. CSE572: Data Mining by H. Liu

  14. Patterns (1) • Patterns are the relationships and summaries derived through a data mining exercise. • Patterns must be: • valid • novel • potentially useful • understandable CSE572: Data Mining by H. Liu

  15. Patterns (2) • Patterns are used for • prediction or classification • describing the existing data • segmenting the data (e.g., the market) • profiling the data (e.g., your customers) • Detection (e.g., intrusion, fault, anomaly) CSE572: Data Mining by H. Liu

  16. Data (1) • Data mining typically deals with data that have already been collected for some purpose other than data mining. • Data miners usually have no influence on data collection strategies. • Large bodies of data cause new problems: representation, storage, retrieval, analysis, ... CSE572: Data Mining by H. Liu

  17. Data (2) • Even with a very large data set, we are usually faced with just a sample from the population. • Data exist in many types (continuous, nominal) and forms (credit card usage records, supermarket transactions, government statistics, text, images, medical records, human genome databases, molecular databases). CSE572: Data Mining by H. Liu

  18. Typical DM tasks • Classification: • mining patterns that can classify future data into known classes. • Association rule mining: • mining any rule of the form X Y, where X and Y are sets of data items. • Clustering: • identifying a set of similar groups in the data CSE572: Data Mining by H. Liu

  19. Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence • Deviation/anomaly/exception detection: discovering the most significant changes in data • Data visualization: using graphical methods to show patterns in data. • High performance computing • Bioinformatics CSE572: Data Mining by H. Liu

  20. Why data mining • Rapid computerization of businesses produces huge amounts of data • How to make best use of data? • A growing realization: knowledge discovered from data can be used for competitive advantage and to increase business intelligence. CSE572: Data Mining by H. Liu

  21. Make use/sense of your data assets • Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion” • Fast identify underlying relationships and respond to emerging opportunities CSE572: Data Mining by H. Liu

  22. Why now and for the near future • The data is abundant. • The data is being collected or warehoused. • The computing power is affordable. • The competitive pressure is increasingly. • Data mining tools have become available. • New challenges • New data types evolve • New applications emerge CSE572: Data Mining by H. Liu

  23. DM fields • Data mining is an emerging multi-disciplinary field: Statistics Machine learning Databases Visualization OLAP and data warehousing High-performance computing ... CSE572: Data Mining by H. Liu

  24. Summary • What is data mining? KDD - knowledge discovery in databases: non-trivial extraction of implicit, previously unknown and potentially useful information • Why do we need data mining? Wide use of computer systems - data explosion - knowledge is power – but we’re data rich, knowledge poor – useful, understandable and actionable knowledge ... • Data mining is not a plug-and-play, so we are not done yet and need to continue this class … CSE572: Data Mining by H. Liu

  25. An Overview of KDD Process (Guess which is which) CSE572: Data Mining by H. Liu

  26. Web mining – an application • The Web is a massive database • Semi-structured data • XML and RDF • Web mining • Content • Structure • Usage • Link analysis CSE572: Data Mining by H. Liu

More Related