220 likes | 307 Views
CSE591 (575) Data Mining. 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU. Introduction. Introduction to this Course Introduction to Data Mining. Introduction to the Course. First, about you - why take this course? Your background and strength AI, DBMS, Statistics, Biology, …
E N D
CSE591 (575) Data Mining 1/21/2003 - 5/6/2003 Computer Science & Engineering ASU
Introduction Introduction to this Course Introduction to Data Mining
Introduction to the Course • First, about you - why take this course? • Your background and strength • AI, DBMS, Statistics, Biology, … • Your interests and requests • What is this course about? • Problem solving • Handling data • transform data to workable data • Mining data • turn data to knowledge • validation and presentation of knowledge
This course • What can you expect from this course? • Knowledge and experience about DM • Problem solving and solution presentation • How is this course conducted? • Presentations • Individual projects • Course Format • Individual Projects 40% • Exams and/or quizzes 40% • Class participation 20% • off-campus students?
Projects - Start NOW! • How to start? • Projects should be sufficiently challenging but reasonable, suitable for one semester • How to choose your individual project • Real-world problems • Problems that might make differences • Two types of projects • Available projects • Self-proposed projects (Approval’s needed)
Some project ideas • Dealing with high dimensional data • Data of supervised, unsupervised learning • Image mining • Feature extraction, clustering of images • Active sampling • Various data structures (kd-trees, R-trees, Multi-Dimen Scaling) • Meta data (RDF, namespace) for mining • Ensemble learning • Sequence mining (HMM learning) • Bioinformatics and applications (feature selection) • Intelligent driving data analysis • Data integration, data reduction (random projection)
How is a project evaluated? • It depends on • What do you want to achieve • Its impact • Your effort • The sooner you start, the better • The beginning is not easy
Course Web Site • http://www.public.asu.edu/~huanliu/cse591.html • My office and office hours • GWC 342 • T 10:30 - 11:30am and Th 4:00-5:00pm • My email: hliu@asu.edu • Slides and relevant information will be made available at the course web site
Any questions and suggestions? • Your feedback is most welcome! • I need it to adapt the course to your needs. • Please feel free to provide yours anytime. • Share your questions and concerns with the class – very likely others may have the same. • No pain no gain – no magic for data mining. • The more you put in, the more you get • Your grades are proportional to your efforts.
Introduction to Data Mining Definitions Motivations of DM Interdisciplinary Links of DM
What is DM? • Or more precisely KDD (knowledge discovery from databases)? • Many definitions • A process, not plug-and-play raw data transformed data preprocessed data data mining post-processing knowledge • One definition is • A non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data
Need for Data Mining • Data accumulate and double every 9 months • There is a big gap from stored data to knowledge; and the transition won’t occur automatically. • Manual data analysis is not new but a bottleneck • Fast developing Computer Science and Engineering generates new demands • Seeking knowledge from massive data • Any personal experience?
When is DM useful • Data rich • Two invited talks so far have convincingly demonstrate it • Large data (dimensionality and size) • Image data (size) • Gene data (dimensionality) • Little knowledge about data (exploratory data analysis) • What if we have some knowledge?
DM perspectives • Prediction, description, explanation, optimization, and exploration • Completion of knowledge (patterns vs. models) • Understandability and representation of knowledge • Some applications • Business intelligence (CRM) • Security (Info, Comp Systems, Networks, Data, Privacy) • Scientific discovery (bioinformatics)
Challenges • Increasing data dimensionality and data size • Various data forms • New data types • Streaming data, multimedia data • Efficient search and data access • Intelligent update and integration
Interdisciplinary Links of DM • Statistics • Databases • AI • Machine Learning • Visualization • High Performance Computing • supercomputers, distributed/parallel/cluster computing
Statistics • Discovery of structures or patterns in data sets • hypothesis testing, parameter estimation • Optimal strategies for collecting data • efficient search of large databases • Static data • constantly evolving data • Models play a central role • algorithms are of a major concern • patterns are sought
Relational Databases • A relational databases can contain several tables • Tables and schemas • The goal in data organization is to maintain data and quickly locate the requested data • Queries and index structures • Query execution and optimization • Query optimization is to find the best possible evaluation method for a given query • Providing fast, reliable access to data for data mining
AI • Intelligent agents • Perception-Action-Goal-Environment • Search • uniform cost and informed search algorithms • Knowledge representation • FOL, production rules, frames with semantic networks • Knowledge acquisition • Knowledge maintenance and application
Machine Learning • Focusing on complex representations, data-intensive problems, and search-based methods • Flexibility with prior knowledge and collected data • Generalization from data and empirical validation • statistical soundness and computational efficiency • constrained by finite computing & data recourses • Challenges from KDD • scaling up, cost info, auto data preprocessing
Visualization • Producing a visual display with insights into the structure of the data with interactive means • zoom in/out, rotating, displaying detailed info • Various branches of visualization methods • show summary properties and explore relationships between variables • investigate large databases and convey lots of information • analyze data with geographic/spatial location • A pre- and post-processing tool for KDD
Bibliography • W. Klosgen & J.M. Zytkow, edited, 2001, Handbook of Data Mining and Knowledge Discovery.