180 likes | 290 Views
The Inductive Software Engineering Manifesto. Principles for Industrial Data Mining. Presentation By: Ebeid Soliman & Mason Schoolfield. Paper Authored By: Menzies & Kocaganeli – Lane Dept of CS/EE, WVU Bird, Zimmerman, & Schulte – Microsoft Research. Motivation.
E N D
The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Presentation By: EbeidSoliman & Mason Schoolfield Paper Authored By: Menzies & Kocaganeli – Lane Dept of CS/EE, WVU Bird, Zimmerman, & Schulte – Microsoft Research
Motivation This paper is a reflection of the authors’ applied data mining work, discussions with researchers, and software engineering practitioners. Document methods and experience from industrial practitioners The principal questionis : what characterizes the difference between academic and industrial data mining ? Motivation: Successful data-mining projects in industry
Inductive Software Engineering “A branch of software engineering that focuses on the delivery of data mining based software applications to users” Understand user goals to inductively generate the models that most matter to the user Industrial practitioners are focused on users, whereas academic data mining research is focused on algorithms
Industrial Data Mining7 Principles Users before algorithms Plan for scale Early feedback Be open-minded Do smart learning Live with the data you have Broad skill set, big toolkit
Users before algorithms Guiding Principle – Users Before Algorithms Mining algorithms are only good if users fund their use in real-world applications
Users before Algorithms Hallmarks of good interaction meetings • Users bring senior management to the meetings • Users keep interrupting (you or each other) and debating your results • Indicates the users understand your explanation of the results • Your results are touching on issues that concern them • User begin to offer more data sources for analysis • Users invite you to their workspace to show how to do part of the analysis
Plan for scaleKnowledge Discovery in Databases (KDD) • KDD – Knowledge Discovery In Databases • The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data • Repetition Required Steps that compose the KDD process - Fayyad 1996
Plan for scale • Most data mining is data pre-processing • Gaining access to databases in business groups is time consuming • To ensure repeatability automate as many KDD steps as possible • Data mining methods are repeated multiple times • Answer user questions • Enhance data mining method or Fix bugs • Deploy to different user groups
Plan for scale • Observed Phases • Scout - rapid prototyping, apply many methods to data, explore range of hypotheses, gain user interest (get feedback) • Survey - experiment to find stable models - focusing on user goals • Build - integrate models into a deployment framework – suitable for target user base • Team size doubles after scouting, doubles after surveying – time implications!
Early feedback Simplicity first: before conducting very elaborate studies, try applying very simple tools to gain rapid early feedback Get Feedback Early and Often Discretize continuous attributes (determine what is ignorable)
Be open-minded Avoid a fixed hypothesis Avoid a fixed approach, particularly for data not been mined before Initial results are important and can change goals
Smart Learning • Inductive agents, human or otherwise, make errors • Don’t torture the data to meet preconceptions, but it can be ok to go “fishing” • Important outcomes are riding on your conclusions- check & validate! • Check the variance before concluding, it may be based on statistical noise • Check conclusion stability against different sample sizes • Check conclusion support to avoid conclusions based on a small percent of the data
Smart Learning Prevent spurious conclusions by carefully controlling data collection and focusing on a small space of hypotheses (IF YOU CAN) Rule learners – RIPPER and INDUCT check against randomly generated alternatives (if probabilities are the same you can delete the rule)
Live with the data you have • Collecting data comes at a cost! • Go mining with the data you have, not the data you hope to have at a later date • Remove spurious data - conduct instance or feature selection studies • 80 to 90% of rows and all but the square root of columns can be deleted before compromising performance of the learned model • Be respectful but doubtful to all user-suggested domain hypotheses
Broad skill set, big toolkit • Try multiple inductive technologies • Inductive Engineers generate novel and insightful feedback for users • Researchers can work to perfect a single algorithm • Big ecology: Use tools supported by a large ecosystem of developers who are constantly building new modules (e.g. R, WEKA, MATLAB)
What does this mean for Industry? • Implications for Project Management • Scouting takes weeks, Surveying takes months, and Building takes years • Implications for Training • Communications skills • Results briefing • Scripting
Research to help Industry • Research themes to benefit industrial data mining • Analysis patterns for inductive engineers (like design patterns for developers) • Design pattern for data miners • Optimizations of learning algorithms • Anomaly detectors • Business-aware learners
Final Notes Conclusion – Be user-focused, keep these principles in mind Hopefully these generalities will be helpful Share your experiences and knowledge so that Industrial Inductive Engineering can mature