340 likes | 430 Views
Profiting from Data Mining. Gio Wiederhold November 2003. ?. Model based. Steps needed to profit. Obtaining relevant data Always incomplete Extracting relationships Imputing causality Finding applicability Determining leverage points Inventing candidate actions
E N D
Profiting from Data Mining Gio Wiederhold November 2003
? Model based Steps needed to profit • Obtaining relevant data • Always incomplete • Extracting relationships • Imputing causality • Finding applicability • Determining leverage points • Inventing candidate actions • Assessing likely outcomes and benefits • Selecting action to be taken • Measuring the outcome Collecting data for next round
Today's Problem: Disjointness • Database administrators • Focus on data collection, organization, currency • Analysts • Focus on slicing, dicing, relationships • Middle managers • Focus on their costs, profits • MBAs • Focus on business models, planning • Executives • Must make decisions based on diverse inputs
1. Data Collection Two choices • (rare) Collect data specifically for analysis • allows careful design -- • model causes and effects Purchase = f(price, color, size, custumer inc., gender,. ,, • costly • often small to make collection manageable • imposes delays • (common) Use data collected for other purposes • take advantage of what is readily available • low cost • filtering, reformatting, integration • incomplete - rarely covers all causes / effects • biased -- missing categories • only people with phones, cars -- shopping in super markets
1a. Data Integration Needed when sources have inadequate coverage • in distinct DBs for • Prices, Number purchased • Customer segments (supermarket, stores, on-line) implies some expectations append attributes where keys match: Joe include semantic match Joe = 012 34 567 append rows where key types match: customer include semantic match customer = owner
2. Data analyis • Find relationships • already known - ignore or adjust in next round • requires comparison with expert knowledge • now have quantification • unknown • uninteresting per expert • interesting per expert
use temporal information purchase of Chinese vs other food invent surrogates: names, ZIP codes, 3. Establish causality • Already known -- Prior Model • Butis it complete, i.e., does it explain all effects ? • Analyze relationships • use expertise to decide direction • often obvious "common world knowledge" • sometimes ambiguous smoking Cancer not-smoking • often major true cause not captured in data food color 10%, food price 20%, buyer gender 2% unknown 75% guess: ethnicity, income
Careful drivers! Establishing causality is risky 1. Is a Volvo a safe car? Mined: Volvos have fewer accidents 2. What causes accidents? Drivers! 3. Who buys Volvos? • 4. Must determine • effect of safe drivers • percentage of safe drivers overall • percentage of safe drivers with Volvos • 5. How much of the accident rate is now explained? • The unexplained difference can be attributed to the car.
controllable causes side effects interesting beneficial effects external causes side effects hidden captured by data Change causecreate effects To use results of data mining • have to understand direction of relationships Model
4. Causes provide the leverage Language of analyst / Language of modeling • Many causes -- independent variables • A few may be controllable • Some may be controlled by our competition • Others are forces-of-nature • Even more effects -- dependent variables • A few may be desired • Some may be disastrous • Many are poorly understood • Intermediate effects • Provide a means for measuring effectiveness • Allow correction of actions taken
now 5. Planning & Assessment Analyze Alternatives • Current Capabilities • Future Expectations Process tasks: • List resources • Enumerate alternatives • Prune alternative • Compare alternatives Predict the future
Prediction Requires Tools Ó E-mail this book, Alfred Knopf, 1997
DM gH Iv Xy mN Simulations predict • Back-of-the-envelope • Common • Adequate if model is simple • Assumptions are easily forgotten after some time, not distinguished from data "Why are we doing this" • Spreadsheets • Most common computing tool • Specialist modeler can help • New, recent data can be pasted in • Awkward for the tree of future alternatives 3. Constructed to order • Costly, powerful technology • Specialist modelers required • Expressive simulation languages • Requires specialists to set up, run, and rerun with new data
0.15 now 0.4 0.25 0.18 0.6 0.12 0.2 0.3 0.19 0.1 0.17 0.11 0.3 0.4 0.13 time Simulation results: likelihoods Next period alternatives and subsequent periods uncertainty increases
Simulation services Wide variety, but common principle Inputs Model Output (time, $, place, ...) • Spreadsheets Identify independent, controlable, and resulting values 2. Execution specific to query: what-if assessment • may require HPC power for adequate response 3. Continously executing: weather prediction • Search for best match ( location, time ) 4. Past simulations results collected for future use Typically sparse -- the dimension of the futures is too large: • Tables in a design handbook: materials Perform inter- or extra-polations to match query parameters
6. Specify Value of Effects Still needed: Value of alternative outcomes • Decision maker / owner input • Benefits and Costs • Potential Profit • Correct for risk, and adjust to present value 1000 2000 5000 1000 0 -2000 -6000 Values time past now futures
0.15 0.4 0.25 0.18 0.6 0.12 0.2 0.3 0.19 0.1 0.17 0.11 0.3 0.4 0.13 Having it all together • Relationships from analyses of past data • Data representing the current state • List of actionable alternatives • Tree of subsequent alternatives • Probabilities of those alternatives • Values of the outcomes • Ability to predict the likelihood of futures 1000 2000 5000 1000 0 -2000 -6000 Values
o o o o o o Vision: Putting it all together Combine results mined from past data, current observations, and predictions into the future. Decision Maker time Support specialists
past now future time Needed: Information Systems that alsoproject seamlessly into the Futures Support of decision-making requires dealing with the futures, as well the past • Databases deal well with the past • Streaming sensors supply current status • Spreadsheets, simulations deal with the likely futures Future information systems should combine all these sources
Connecting it all Build super systems • Coherent, consistent • Expensive • Unmaintainable • Too many cooks: • Database folk • Data miners • Analysts • Planners • Simulation specialists • Decision makers • Develop interfaces • Incremental • Composable as needed • Heterogeneous • Interfaces required: Metadata • Database to miners: SQL • Mined results to analysts: XML? • Analysts to planners ? • Planners to Simulations? SimQL • Decision makers: New tools !
Interfaces enable integration:New:SimQL to access Simulations past now futures time Msg systems, Sensors Streaming data Databases and schemas, accessed via SQL or XML Simulations, accessed via SimQL and schema compliant wrappers
Developer Customer Query Development Interaction Production Interaction Parser Help Schema Commands Help Schema Commands Schema Manager Query manager Use of Access Specs Initiation and Results of Simulations Metadata Manager Error reports Filing of Access Specs Wrapped .. Simulations Metadata o o SimQL proof-of-concept Implementation
wrapper wrapper Engineering simulation Business planning spreadsheets Demonstration of SimQL Simple GUI common language requirements Test Applications wrapper Shipping location database Weather on the Internet
0.2 0.3 0.6 0.1 0.07 0.03 0.5 0.5 0.3 0.5 0.2 0.1 time 0.2 0.1 0.1 0.4 prob Information system use of simulation results Simulation results are mapped to alternative Courses-of-actions Information system should support model driving the the computation and recomputation of likelihoods Likelihoods change as now moves forwards and eliminates earlier alternatives.
prob value 0.4 0.5 100 600 1100 500 200 200 -420 0 -820 -400 0.3 0.1 . Next period alternatives 0.1 0.3 and subsequent periods 0.2 0.3 0.1 0.6 0.2 0.07 0.2 0.1 . 0.4 0.13 . past now future time The likelihoods multiply out to the end-effects then their values can be applied to earlier nodes 1000 2000 5000 1000 0 -6000 -3000 Values 1200 66 134 -1220 1266 -1086
? ? 1266 ? time Msgs sensors Spreadsheets, other simulations, Databases, . . . Recomputation is needed at the next time phase A Pruned Bush Re-assess as time marches forward ! 1000 2000 5000 1000 0 100 600 1100 500 200 200 0 1200 66 past now future
point-in-time for situational assessment Even the present needs SimQL last recorded observations simple simulations to extrapolate data past now future time • Is the delivery truck in X? • Is the right stuff on the truck? • Will the crew be at X? • Will the forces be ready to accept delivery? Not all data are current:
Integrative information systems: research questions • What human interfaces can support the decision maker? • How to move seamlessly from the past to the future? • What system interfaces are good now and stay adaptable • How can multiple futures be managed (indexed)? • How can multiple futures be compared, selected? • How should joint uncertainty be computed? • How can the NOW point be moved automatically?
SimQL research questions • How little of the model needs to be exposed? • How can defaults be set rationally? • How should expected execution cost be reported? • How should uncertainty be reported? • Are there differences among application areas that require different language structures? • Are there differences among application areas that require different language features? • How will the language interface support effective partitioning and distribution?
Moving to a Service Paradigm Interfaces define service potentials • Server is an independent contractor, defines service • Client selects service, and specifies parameters • Server’s success depends on value provided • Some form of payment is due for services x,y Databases are a current example. Simulations have the same potential.
Summary of SimQL A new service for Decision Making: • follows database paradigm • ( by about 25 years ) • coherence in prediction • displacement of ad-hoc practices • seamless information integration • single paradigm for decision makers • simulation industry infrastructure • investment has a potential market • should follows database industry model: Interfaces promote new industries
Do not interoperate Summary:Today decision making support is disjoint, each community improves its area and ignores others Databases Planning Science Simulation Distribution extensions for network support are also disjoint
past now future time Intuition + x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a Data integration Databases distributed, heterogeneous The decisionmaker has few tools organized support disjointed support • Spreadsheets • Planning of allocations • Other simulations • various point assessments
o o o o o o Coda:Put relevant work together and move on Support integration of results mined from past data, current observations, and predictions about the futures. Decision Maker Databases Real Information Systems Human interfaces Service interfaces Data Mining Modeling tools ? Simulation Support Services