350 likes | 552 Views
Data mining: theory and applications. Heikki Mannila. Data Mining: Theory and Applications. Data analysis becoming more important in other sciences and in industry New measurement methods Ability to store data High-dimensional large data sets
E N D
Data mining: theory and applications Heikki Mannila
Data Mining: Theory and Applications • Data analysis becoming more important in other sciences and in industry • New measurement methods • Ability to store data • High-dimensional large data sets • Non-traditional forms (e.g., strings, trees, graphs) • Data analysis lags behind
Data mining • Has emerged as a major research area in the interface of computer science and statistics • Machine learning, databases, algorithms • Data analysis questions are increasingly visible in database and algorithms research • Theory and practice interact
Goals • Develop novel data analysis techniques for the use of other sciences and industry • How? • Look at data analysis problems arising in practice • Abstract new computational concepts from them • Analyse the concepts and develops new computational methods • Take the results into practice • Theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas • The applications feed interesting novel questions to theoretical research
Major themes in methods • Pattern discovery • Methods for sequence decomposition • Interplay of combinatorial and continuous methods in data mining • Techniques for the decomposition of large 0-1 data sets.
Application areas • Genome structure • Gene expression data analysis • Palaeontology • Linguistic applications • Ubiquitous computing
The people • Heikki Mannila, Hannu Toivonen, Jaakko Hollmen, Aristides Gionis, Floris Geerts, Bart Goethals • 6 Ph.D. students • A visible position in the international community
Highlights • Finding recurrent sources in sequences • global structure in genomic sequences • recognizing recurrent contexts in mobile device usage • (k,h)-segmentation • Finding orderings of attributes from unordered binary data • Fragements of order • Spectral ordering techniques • Pattern discovery and mixture modelling techniques for onomastic data sets • Methods for finding topics in 0-1 datasets on the basis of co-occurrence information
Finding recurrent sources in sequences • Sequences • DNA • Telecommunications • Etc. • How to find some global structure from a sequence? • Try to find homogenous segments from the sequence
Finding homogenous segments T = S1 S2 S3S4 S5 S6 • Sequence T, integer k • Measure of homogeneity H for segments of T • E.g., H(S) = |S| Var(S) • Find the division T = S1,S2,…, Sk minimizing • Dynamic programming • (k,k)-segmentation: k-segments with no relationship to each other; independent sources
(k,h)-segmentation • We want to limit the number of different types of segments • Only h<k different types are allowed • Find the best segmentation of T into k segments by using only h different types of segments
Source 1 Source 3 Source 2 (6,3)-segmentation
(k,h)-segmentation problem • Given sequence T • Find h sourcesw1,w2,…, wh • A decomposition of sequence T into k segments T = S1 S2 … Sk • Minimizing the sum of distances from each point t to the source wa(t) of the segment to which t belongs to
Results • (k,h)-segmentation problem is NP-hard for dimension d>1, for L1 and L2metrics • Dimension d=1: complexity open • Simple approximation algorithms • d=1: 3-approximation for L1 • d=1: -approximation for L2 • d>1: 3+e –approximation for L1 for any e>0 • d>1: A+2 –approximation for L2, where A is the best approximation factor for k-means clustering • Very good performance in practice • The algorithms work for any generative model (not just reals with Lp metrics)
Example: onomastic data • Names of lakes in Finland • About 150,000 lakes • What are the main trends? • High-dimensional marked point process • Collaboration with Research Center for the Languages of Finland (Kotus) • Similar data analysis problems arise also in environmental sciences
Clustering on the basis of the names of lakes Similarity with the names of lakes in Kangasala
Example: paleontological data • Given a matrix of occurrences of species in fossil sites • Ages of the fossil sites are not available • How to order the sites according to their age? • Background information: species arrive and vanish • Try to find ordering that minimizes Lazarus events species A B C 0 0 1 1 1 0 1 0 1 0 1 0 time Lazarus events
Methods • Spectral ordering: form a Laplacian of the co-occurrence matrix, look at eigenvectors • Fragments of order: find short segments of orders which are not violated by observations • Other applications: text analysis, telecommunications
Future research directions • Theory and practice • The combination of continuous and combinatorial methods • Concepts and algorithms for describing structure of sequences • Methods for pattern discovery in and modelling of spatiotemporal data • Theoretical models for data mining (such as inductive databases) • Foundational issues in pattern discovery (e.g., logical form of patterns and the difficulty in discovering them) • Publications, collaborations, software releases
Applications in the future • Genome structure and its relation to function • Linguistic applications: spatial and temporal variation in language • Ubiquitous computing and telecommunications applications • Paleontological and ecological applications
Mobile Computing Research at HIIT Kimmo Raatikainen Research Director Helsinki Institute for Information Technology kimmo.raatikainen@hiit.fi
To address the research challenges arising in mobile computing systems and applications of tomorrow. Mobile computing will fulfil the vision of ubiquitous - invisible - computing providing access and services anytime, anywhere, and anyhow. The key research challenges are related to context-awareness, reconfigurability, adaptability, understanding user needs and experience, and personalization. Fuego Mission ”Any technology distinguishable from magic is insufficiently advanced,” Gregory Benford
Present State • Some 20 researchers organised in two closely co-operating research groups • Mobile Computing Group (Prof. Kimmo Raatikainen) • User Experience Research Group (Prof. Martti Mäntylä) • Other senior researchers and post-docs: • Dr. Ken Rimey (software technologies, distributed computing) • Dr. Pekka Nikander, permanent visitor from Ericsson Research (security and privacy in Mobile Internet) • Dr. Timo Saari (user experience research, media science) • Dr. Jan Lindström (distributed data management, mobile data) • Other postdocs likely to be hired 2004
Current Research Topics • Middleware for Mobile Wireless Internet – Fuego Core project • Mobile distributed event system • Mobile (XML-based) file system with intelligent synchronization • SOAP messaging over wireless (W3C: XML Binary Infoset) • Mobile Presence • Host Identity Protocol • Personal Distributed Information Storage – PDIS project • Synchronization-based peer-to-peer infrastructure for storage of structured XML data: PIM data, metadata for digital media • Context Recognition by User Situation Data Analysis – CONTEXT project • Bridge between User Experience Group at ARU and Adaptive Computing Systems Group at BRU • Software Architectures for Configurable Ubiquitous Systems – Sarcous project by SoberIT at HUT • Managing the large variety of software products
Targets to 2005-2010 – 1/3 • to enlarge and strengthen international co-operation • current: WWRF, UCB, Fraunhofer FOKUS • new: Japan, KCL/Mobile VCE, an European NoE, CMU, … • but not forgetting co-operation in Finland: • HUT, UHE, Tampere Univ Tech, Univ Oulu, UIAH, … • to contribute to software architecture for Wireless World • to address challenges due to personal networking • minimal differences between solution stacks for ad-hoc communities and networked infrastructure • peer-to-peer, device-to-device solutions
Not in primary focus but perhaps latter (and other smart places) Not in primary focus
Targets to 2005-2010 – 2/3 • to put more focus on infrastructure for context-awareness and dynamic (end-user) systems • context modelling: presentation, maintenance, sharing, protection, reasoning, and queries • decision rules for reconfiguration • reflective (self-aware) middleware for personal networking • Fault tolerance in Wireless World • traditional exception will be the usual case • compensations, delayed/delegated actions, … • Trust and privacy in Wireless World
Targets to 2005-2010 – 3/3 • user needs and novel application concepts • human factors of the Wireless World • basic psychosocial mechanisms • what makes a service use experience engaging and sustaining? • user-centric concept design (UCPCD) • process, methods, tools • novel application concepts based on context-awareness, other novel technologies • experience prototypes