1 / 34

Profiting from Data Mining

Profiting from Data Mining. Gio Wiederhold November 2003. ?. Model based. Steps needed to profit. Obtaining relevant data Always incomplete Extracting relationships Imputing causality Finding applicability Determining leverage points Inventing candidate actions

maj
Download Presentation

Profiting from Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiting from Data Mining Gio Wiederhold November 2003

  2. ? Model based Steps needed to profit • Obtaining relevant data • Always incomplete • Extracting relationships • Imputing causality • Finding applicability • Determining leverage points • Inventing candidate actions • Assessing likely outcomes and benefits • Selecting action to be taken • Measuring the outcome  Collecting data for next round

  3. Today's Problem: Disjointness • Database administrators • Focus on data collection, organization, currency • Analysts • Focus on slicing, dicing, relationships • Middle managers • Focus on their costs, profits • MBAs • Focus on business models, planning • Executives • Must make decisions based on diverse inputs

  4. 1. Data Collection Two choices • (rare) Collect data specifically for analysis • allows careful design -- • model causes and effects Purchase = f(price, color, size, custumer inc., gender,. ,, • costly • often small to make collection manageable • imposes delays • (common) Use data collected for other purposes • take advantage of what is readily available • low cost • filtering, reformatting, integration • incomplete - rarely covers all causes / effects • biased -- missing categories • only people with phones, cars -- shopping in super markets

  5. 1a. Data Integration Needed when sources have inadequate coverage • in distinct DBs for • Prices, Number purchased • Customer segments (supermarket, stores, on-line) implies some expectations append attributes where keys match: Joe include semantic match Joe = 012 34 567 append rows where key types match: customer include semantic match customer = owner

  6. 2. Data analyis • Find relationships • already known - ignore or adjust in next round • requires comparison with expert knowledge • now have quantification • unknown • uninteresting per expert • interesting per expert

  7. use temporal information purchase of Chinese vs other food invent surrogates: names, ZIP codes, 3. Establish causality • Already known -- Prior Model • Butis it complete, i.e., does it explain all effects ? • Analyze relationships • use expertise to decide direction • often obvious "common world knowledge" • sometimes ambiguous smoking  Cancer not-smoking • often major true cause not captured in data food color 10%, food price 20%, buyer gender 2% unknown 75% guess: ethnicity, income

  8. Careful drivers! Establishing causality is risky 1. Is a Volvo a safe car? Mined: Volvos have fewer accidents 2. What causes accidents? Drivers! 3. Who buys Volvos? • 4. Must determine • effect of safe drivers • percentage of safe drivers overall • percentage of safe drivers with Volvos • 5. How much of the accident rate is now explained? • The unexplained difference can be attributed to the car.

  9. controllable causes side effects interesting beneficial effects external causes side effects hidden captured by data Change causecreate effects  To use results of data mining • have to understand direction of relationships Model

  10. 4. Causes provide the leverage Language of analyst / Language of modeling • Many causes -- independent variables • A few may be controllable • Some may be controlled by our competition • Others are forces-of-nature • Even more effects -- dependent variables • A few may be desired • Some may be disastrous • Many are poorly understood • Intermediate effects • Provide a means for measuring effectiveness • Allow correction of actions taken

  11. now 5. Planning & Assessment Analyze Alternatives • Current Capabilities • Future Expectations Process tasks: • List resources • Enumerate alternatives • Prune alternative • Compare alternatives Predict the future

  12. Prediction Requires Tools Ó E-mail this book, Alfred Knopf, 1997

  13. DM gH Iv Xy mN Simulations predict • Back-of-the-envelope • Common • Adequate if model is simple • Assumptions are easily forgotten after some time, not distinguished from data "Why are we doing this" • Spreadsheets • Most common computing tool • Specialist modeler can help • New, recent data can be pasted in • Awkward for the tree of future alternatives 3. Constructed to order • Costly, powerful technology • Specialist modelers required • Expressive simulation languages • Requires specialists to set up, run, and rerun with new data

  14. 0.15 now 0.4 0.25 0.18 0.6 0.12 0.2 0.3 0.19 0.1 0.17 0.11 0.3 0.4 0.13 time Simulation results: likelihoods Next period alternatives and subsequent periods uncertainty increases

  15. Simulation services Wide variety, but common principle Inputs Model Output (time, $, place, ...) • Spreadsheets Identify independent, controlable, and resulting values 2. Execution specific to query: what-if assessment • may require HPC power for adequate response 3. Continously executing: weather prediction • Search for best match ( location, time ) 4. Past simulations results collected for future use Typically sparse -- the dimension of the futures is too large: • Tables in a design handbook: materials Perform inter- or extra-polations to match query parameters

  16. 6. Specify Value of Effects Still needed: Value of alternative outcomes • Decision maker / owner input • Benefits and Costs • Potential Profit • Correct for risk, and adjust to present value 1000 2000 5000 1000 0 -2000 -6000 Values time past now futures

  17. 0.15 0.4 0.25 0.18 0.6 0.12 0.2 0.3 0.19 0.1 0.17 0.11 0.3 0.4 0.13 Having it all together • Relationships from analyses of past data • Data representing the current state • List of actionable alternatives • Tree of subsequent alternatives • Probabilities of those alternatives • Values of the outcomes • Ability to predict the likelihood of futures 1000 2000 5000 1000 0 -2000 -6000 Values

  18. o o o o o o Vision: Putting it all together Combine results mined from past data, current observations, and predictions into the future. Decision Maker time Support specialists

  19. past now future time Needed: Information Systems that alsoproject seamlessly into the Futures Support of decision-making requires dealing with the futures, as well the past • Databases deal well with the past • Streaming sensors supply current status • Spreadsheets, simulations deal with the likely futures Future information systems should combine all these sources

  20. Connecting it all Build super systems • Coherent, consistent • Expensive • Unmaintainable • Too many cooks: • Database folk • Data miners • Analysts • Planners • Simulation specialists • Decision makers • Develop interfaces • Incremental • Composable as needed • Heterogeneous • Interfaces required: Metadata • Database to miners: SQL • Mined results to analysts: XML? • Analysts to planners ? • Planners to Simulations? SimQL • Decision makers: New tools !

  21. Interfaces enable integration:New:SimQL to access Simulations past now futures time Msg systems, Sensors Streaming data Databases and schemas, accessed via SQL or XML Simulations, accessed via SimQL and schema compliant wrappers

  22. Developer Customer Query Development Interaction Production Interaction Parser Help Schema Commands Help Schema Commands Schema Manager Query manager Use of Access Specs Initiation and Results of Simulations Metadata Manager Error reports Filing of Access Specs Wrapped .. Simulations Metadata o o SimQL proof-of-concept Implementation

  23. wrapper wrapper Engineering simulation Business planning spreadsheets Demonstration of SimQL Simple GUI common language requirements Test Applications wrapper Shipping location database Weather on the Internet

  24. 0.2 0.3 0.6 0.1 0.07 0.03 0.5 0.5 0.3 0.5 0.2 0.1 time 0.2 0.1 0.1 0.4 prob Information system use of simulation results Simulation results are mapped to alternative Courses-of-actions Information system should support model driving the the computation and recomputation of likelihoods Likelihoods change as now moves forwards and eliminates earlier alternatives.

  25. prob value 0.4 0.5 100 600 1100 500 200 200 -420 0 -820 -400 0.3 0.1 . Next period alternatives 0.1 0.3 and subsequent periods 0.2 0.3 0.1 0.6 0.2 0.07 0.2 0.1 . 0.4 0.13 . past now future time The likelihoods multiply out to the end-effects then their values can be applied to earlier nodes 1000 2000 5000 1000 0 -6000 -3000 Values 1200 66 134 -1220 1266 -1086

  26. ? ? 1266 ? time Msgs sensors Spreadsheets, other simulations, Databases, . . . Recomputation is needed at the next time phase A Pruned Bush Re-assess as time marches forward ! 1000 2000 5000 1000 0 100 600 1100 500 200 200 0 1200 66 past now future

  27. point-in-time for situational assessment Even the present needs SimQL last recorded observations simple simulations to extrapolate data past now future time • Is the delivery truck in X? • Is the right stuff on the truck? • Will the crew be at X? • Will the forces be ready to accept delivery? Not all data are current:

  28. Integrative information systems: research questions • What human interfaces can support the decision maker? • How to move seamlessly from the past to the future? • What system interfaces are good now and stay adaptable • How can multiple futures be managed (indexed)? • How can multiple futures be compared, selected? • How should joint uncertainty be computed? • How can the NOW point be moved automatically?

  29. SimQL research questions • How little of the model needs to be exposed? • How can defaults be set rationally? • How should expected execution cost be reported? • How should uncertainty be reported? • Are there differences among application areas that require different language structures? • Are there differences among application areas that require different language features? • How will the language interface support effective partitioning and distribution?

  30. Moving to a Service Paradigm Interfaces define service potentials • Server is an independent contractor, defines service • Client selects service, and specifies parameters • Server’s success depends on value provided • Some form of payment is due for services x,y Databases are a current example. Simulations have the same potential.

  31. Summary of SimQL A new service for Decision Making: • follows database paradigm • ( by about 25 years ) • coherence in prediction • displacement of ad-hoc practices • seamless information integration • single paradigm for decision makers • simulation industry infrastructure • investment has a potential market • should follows database industry model: Interfaces promote new industries

  32. Do not interoperate Summary:Today decision making support is disjoint, each community improves its area and ignores others Databases Planning Science Simulation Distribution extensions for network support are also disjoint

  33. past now future time Intuition + x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a Data integration Databases distributed, heterogeneous The decisionmaker has few tools organized support disjointed support • Spreadsheets • Planning of allocations • Other simulations • various point assessments

  34. o o o o o o Coda:Put relevant work together and move on Support integration of results mined from past data, current observations, and predictions about the futures. Decision Maker Databases Real Information Systems Human interfaces Service interfaces Data Mining Modeling tools ? Simulation Support Services

More Related