Data Mining: Crossing the Chasm

Data Mining: Crossing the Chasm Rakesh Agrawal IBM Almaden Research Center

Thesis • The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology • We have the opportunity to make this transition successful

Outline • Chasm in the technology adoption life cycle, à la Geoffrey Moore† • Experience with Quest/Intelligent Miner • Ideas for successful chasm crossing • Geoffrey A Moore. Crossing the Chasm. Harper Business. http://www.chasmgroup.com

Technology Adoption Life Cycle Pragmatists: Stick with the herd! Conservatives: Hold on! Visionaries: Get ahead of the herd! Skeptics: No way! Techies: Try it! Early Adopters Early Majority Late Majority Laggards Innovators Psychographic profile of each group is different

Innovators: Technology Enthusiasts • Intrigued by any fundamental advance in technology • Like to alpha test new products • Can ignore the missing elements • Want access to top technologists • Want no-profit pricing (preferably free) Gatekeepers to early adopters

Early Adopters: Visionaries • Driven by vision of dramatic competitive advantage via revolutionary breakthroughs • Great imagination for strategic applications • Not so price-sensitive • Want rapid time to market • Demand high degree of customization Fund the development of early market

Early Majority: Pragmatists • Want sustainable productivity improvement through evolutionary change • Astute managers of mission-critical apps • Understand real-world issues and tradeoffs • Focus on proven applications; want to see the solution in production Bulwark of the mainstream market

Late Majority: Conservatives • Want to stay even with the competition • Risk averse • Price sensitive • Need completely pre-assembled solutions Extend technology life cycles

Laggards: Skeptics • Driven to maintain status quo • Good at debunking marketing hype • Disbelieve productivity-improvement arguments • Can be formidable opposition to early adoption of a technology Retard the development of high-tech markets

Crack in the curve Chasm Mainstream Market Early Market The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.

Adventurous First strike capability Early buy-in State of the art Think big Spend big Prudent Staying power Wait-and-see Industry standard Manage expectation Spend to budget Visionaries vs. Pragmatists

Is data mining following this curve? • Yes!!! • My personal viewpoint based on Quest/Intelligent Miner experience

Quest • Started as skunk work in early nineties • Inspired by needs articulated by industry visionaries: • Transaction data collected over a long period • Current tools/SQL don’t cut it • About ready to throw data

Approach • Examine “real” applications • Identify operations that cut across applications • Design fast, scalable algorithms for each operation • Develop applications by composing operations

Associations Sequential Patterns Similar time series New Operations Completeness, scalability Classification Clustering Deviations Adopted from Statistics/Learning Scalability Operations http://www.almaden.ibm.com/cs/quest

Bringing Quest to market • Visionaries who inspired Quest did not become first customers: • Wanted evidence that the technology “worked” • Frustrating attempts to interest major IBM customers: • Integration with existing applications • Too-far-out technology • Resistance from in-house analytic groups

First hits • Small information-based companies who provided data in exchange for free results • CIO who wanted to be seen as the technology pioneer in his industry • CIO who wanted the success story to feature in the company’s annual report Led to the formation of a group offering services using Quest

Characteristics of engagements • Mostly associations and sequential patterns • Completeness a big plus • Unanticipated uses • Feedback for further development

Into the product land • Formation of a small “out-of-plan” product group to productize Quest • Facilitated by a closet mathematician • Successes of the services group used for market validation • Continued development and infusion of technology

Intelligent Miner • Serious product • Integrates technologies from various groups • Fast, scalable, runs on multiple platforms • Several “early market” success stories http://www.software.ibm.com/data/iminer/

Are we in the chasm? • Perceived to be sophisticated technology, usable only by specialists • Long, expensive projects • Stand-alone, loosely-coupled with data infrastructures • Difficult to infuse into existing mission-critical applications

Chasm Crossing • Personal speculations on some technical challenges • Do not imply IBM research/product directions

XML-based Data Mining Standard (1) • Model Building: • A pair of standard DTDs for each operation • Interchangeable library of operator implementations Data Specs Standard DTD Parameters Operator Library Standard DTD Model Ack: Mattos, Pirahesh, Schwenkries

XML-based Data Mining Standard (2) Standard DTDs • Model Deployment: • Mapping XML object provides mapping between names and format in the model object and the data record • Model could have been developed on a different system Data Record Model Mapping Application Library Standard DTD Result

Implications • Standard interfaces for application developers to incorporate data mining • Coupling with relational databases • mappings from DTDs to relational schemas • implementation using existing infrastructure

Data Mining Benchmarks • UC Irvine repository • Generating synthetic benchmarks modeled after real data sets is a hard problem • How to map names into meaningful literals • How to preserve empirical distributions Ack: Srikant, Ullman

Auto-focus data mining • Automatic parameter tuning • Automatic algorithm selection (à la join method selection in database query optimization) Ack: Andreas Arning

Web: Greatest opportunity • Huge collection of data (e.g. Yahoo collecting ~50GB every day) • Universal digital distribution medium makes data mining results actionable in fundamentally new ways • But watch for privacy pitfall

Privacy-preserving data mining • Technical vs. legislated solutions • Implication for data mining algorithms when some fields of a data record have been fudged according to the user’s privacy sensitivity Ack: R. Srikant

Personalization • Internet might provide for the first time tools necessary for users to capture information about themselves and to selectively release this information† • Will we be providing these tools? • † John Hagel, Marc Singer. Net Worth. Harvard Business School Press.

What about Association Rules? • Very long patterns • Separating wheat from chaff • Principled introduction of domain knowledge

What else? • Formal foundations of data mining

Closely couple data mining with database systems Embed data mining into applications Focus on web Standard interfaces Benchmarks Auto focussing Personalization Privacy Summary

Concluding remarks • Data mining, a great technology • Combination of intriguing theoretical questions with large commercial interest in the technology • Poised for transitioning into mainstream technology • Will we rise to the challenge as a community?

Acknowledgments

Data Mining: Crossing the Chasm

Data Mining: Crossing the Chasm

Presentation Transcript

A Virtual Chasm

Crossing the Structure Chasm

Data Mining

Data Mining: Crossing the Chasm

Data Mining

Data Mining

Crossing the Chasm: The XID Story

Data Mining: An Introduction

Improving the Quality of Health Care for Mental and Substance-Use Conditions

Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

CHAPTER 17: DATA MINING BASICS

Person-Centered Dementia Caring: Crossing the Quality Chasm

Data Mining with DB

Maximizing and Measuring the ROI On Your CRM Investments

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Crossing the Policy Chasm: How to Connect Health Services Research with Decision-Making