Current Trends in Machine Learning and Data Mining

Current Trends in Machine Learning and Data Mining Dr Marcus Gallagher School of Information Technology and Electrical Engineering University of Queensland 4072 Australia marcusg@itee.uq.edu.au

Talk Overview • Machine Learning • Data mining • Where machine learning fits into data mining • Scientific Data Mining • Machine learning and the scientific process • Current topics & future directions • Summary Australian Virtual Observatory Workshop 2003

Machine Learning • “…the study of computer algorithms capable of learning to improve their performance on a task on the basis of their own experience.” • Often this is “learning from data”. • A sub-discipline of artificial intelligence, with large overlaps into statistics, pattern recognition, visualization, robotics, control, … Australian Virtual Observatory Workshop 2003

Data Mining • “…the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” Australian Virtual Observatory Workshop 2003

Data Mining • Modern science is driven by data analysis like never before. We have an ability to collect and process data that is increasing exponentially! Australian Virtual Observatory Workshop 2003

Data mining: increase in CPU performance (1/execution time) Australian Virtual Observatory Workshop 2003

Data mining: trends in memory capacity and cost Australian Virtual Observatory Workshop 2003

Data mining • General trends: • Increase in the size of datasets (in terms of observations and dimensionality). • What to do when #dimensions > # observations? • Interest in data mining for non-numerical data. • “We are drowning in information and starving for knowledge”. – Rutherford D. Roger Australian Virtual Observatory Workshop 2003

Data Mining Define problem Data collection Data preparation Data modelling Interpretation/ Evaluation Implement/ Deploy model Australian Virtual Observatory Workshop 2003

Data Mining Define problem Machine Learning Data collection Data preparation Data modelling Interpretation/ Evaluation Implement/ Deploy model Australian Virtual Observatory Workshop 2003

Data modelling and the scientific process • Data modelling plays an important role at several stages in the scientific process: • Observe and explore interesting phenomena. • Generate hypotheses. • Formulate model to explain phenomena. • Test predictions made by the theory. • Modify theory and repeat (at 2 or 3). • The explosion of data suggests that we need to (partially) automate numerous aspects of the scientific process. Australian Virtual Observatory Workshop 2003

1. Observe and explore interesting phenomena. • Problems here are typically referred to as unsupervised learning in the ML community: • Given a set of d-dimensional data vectors (X1,…,Xn), Xi = (x1,…,xd), build a model of the data to infer properties of the underlying distribution (process) that generated the data. • Key problems: • Dimensionality reduction: developing algorithms that can reduce a dataset of hundreds or thousands of dimensions to just a few for visualization, while retaining as much of the “information” as possible in the original dataset. • Clustering of data – outlier detection: Identifying trends and/or anomalies in datasets. Australian Virtual Observatory Workshop 2003

1. Observe and explore interesting phenomena • Current ML approaches: • Independent Component Analysis – decompose multivariate data with the aim of producing components that are as “statistically independent” as possible. • Related to PCA and factor analysis. • Gaussian mixture models for clustering – uses a semi-parametric probability density estimator that is trained iteratively on data. • Implements a “soft” version of k-means. Australian Virtual Observatory Workshop 2003

1. Observe and explore interesting phenomena • Self-organizing maps and topographic mapping – similar to clustering but where the cluster-centres are constrained to lie in a low-dimensional manifold (and so have a spatial relationship). Australian Virtual Observatory Workshop 2003

3. Formulate model to explain phenomena • Problems here are typically referred to as supervised learning in the ML community: • Given a training set of pairs of input and output data vectors {(Xi,Yi),…,(Xn,Yn)}, where the input values are thought to have some influence on the corresponding output values, build a model of the data that can predict the outputs of unseen (test) inputs. • Key problems: • Regression, classification, forecasting. Australian Virtual Observatory Workshop 2003

3. Formulate model to explain phenomena • Established ML approaches: • Neural networks • Decision-trees and rule-based classifiers • Example applications in astronomy: • Star/galaxy classification – on the basis of optical data. • Photometric redshift evaluation. • Noise identification and removal in gravitational waves detectors. Australian Virtual Observatory Workshop 2003

Current/New Machine Learning Models • Current ML approaches: • Support vector machines – uses insights from computation learning theory and geometry to produce predictors with powerful discrimination and good generalization. • Ensemble methods – improving the accuracy of predictions by using multiple models and bootstrap sampling. • Example: boosting – incrementally constructs an ensemble of “weak” models, where each model is forced to concentrate on the mistakes made by previous models. Australian Virtual Observatory Workshop 2003

Current/New Machine Learning Models • Gaussian Processes – use the machinery of Bayesian inference to model data using stochastic processes. • All information is represented as a probability distribution. • Incorporates uncertainty associated with prior information and predictions made. • Probabilistic graphical models – the model is a probability distribution where dependencies are explicitly encoded. • Generative models. Australian Virtual Observatory Workshop 2003

Current/New Problems in Machine Learning • Active learning: • Assume that data can be generated (measured or labelled) on demand – build a learning algorithm that learns on the basis of self-selected data points (queries). • Aims to reduce the amount of data required, training time of the model, or amount of data that must be manually labelled. Australian Virtual Observatory Workshop 2003

Current/New Problems in Machine Learning • Semi-supervised learning: • Learning from both labelled (expensive, scarce) and unlabelled (abundant, cheap) data. • Aims are similar to active learning. • Transductive learning: • All unlabelled data points belong to the test set. • The algorithm is able to take advantage of the spatial distribution of the data that the real-world generates (and that it will be tested on). Australian Virtual Observatory Workshop 2003

Current/New Problems in Machine Learning • Non-numerical (unreal!) data • Examples: • Strings – text, DNA, computer programs. • Trees – parse trees, XML, phylogenetic trees. • Graphs – organic molecules, go board positions. • Embedding unreal data in continuous space and applying standard ML techniques has limitations. • Need to define appropriate kernels, distance metrics over these data spaces. Australian Virtual Observatory Workshop 2003

Summary • Automated data mining plays an increasingly important role in the scientific process. • Machine Learning techniques are driven by the problems in data mining and provide some effective solutions. • Machine Learning is a highly active area of research – new & evolving algorithms. • The scope of learning problems is widening to deal with important real-world data. Australian Virtual Observatory Workshop 2003

References • E. Mjolsness and D. DeCoste. Machine Learning for Science: State of the Art and Future Prospects. Science (293):2051-2055, 2001. • D. Hand et al. Principles of Data Mining. MIT Press, 2001. • T. Hastie et al. The Elements of Statistical Learning. Springer, 2001. • R. Tagliaferri et al. Neural networks in astronomy. Neural Networks (16):297-319, 2003. • S. Harmeling et al. Kernel-based nonlinear blind source separation. Neural Computation v15, n5, 1089-1124, 2003. • T. Graepel. Getting Real with Unreal Data: Lessons Learned and the Way Ahead. NIPS Workshop on Unreal Data, 2002. • S. J. Hong and S. Weiss: Advances in predictive models for data mining. Pattern Recognition Letters 22(1): 55-61 (2001). Australian Virtual Observatory Workshop 2003

Current Trends in Machine Learning and Data Mining

Current Trends in Machine Learning and Data Mining

Presentation Transcript

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data mining and Machine Learning

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)