DATA ACQUISITION: FOCUSING ON THE CHALLENGE

DATA ACQUISITION: FOCUSING ON THE CHALLENGE Gerald J. Hahn and Necip Doganaksoy Adjunct Faculty, RPI GE Global Research Presentation to 2003 Quality & Productivity Research Conference IBM Watson Research Center May 2003

THE OBVIOUS, THE EXPECTATION, THE REALITY AND THE CHALLENGES • The Obvious • Statistical analyses are based upon data (and assumptions about sampled populations, etc.) • Such analyses are only as good as the data upon which they are based • Bad data lead to more complex, less powerful or invalid analyses • The Expectation:Much attention is given to the data acquisition process in training practitioners and statisticians • The Reality: Little attention is generally given to the data acquisition process in training practitioners and statisticians • The Challenge: Move data-acquisition to front burner • Understand limitations of observational data • Develop disciplined process for data acquisition • Emphasize data acquisition at all levels of training

PROBLEMS WITH OBSERVATIONAL DATA • Problems with “available” databases • Data obtained for purposes other than statistical analysis • Data resides in different data bases • Data purging • Problems with observational data • Missing variables, values and events • Unrepresentative (non-random) observations • Loss of traceability • Loss of timeliness • Inconsistent or imprecise measurements • Correlated variables and limited variability • Observation from the trenches (Kati Illouz, GE): Data owners tend to be overly optimistic about their data • Key point (not always recognized by practitioners): The quality, rather than the quantity, of the data is what counts • Data inadequacies help define future information needs • Two types of situations • Routine operations (e.g., process monitoring) • Special investigations (e.g., process optimization)

PROCESS FOR DATA ACQUISITION (DEUPM)(in spirit of Six Sigma) • Proposed process: • D: Define the problem • E: Evaluate the available data • U: Understand data acquisition opportunities and limitations • P: Plan data acquisition and implement • M: Monitor, clean data, analyze and validate • Example: Demonstrate, in 6 months, ten-year reliability for new washing machine design • Basic idea: Disciplined, targeted plan for data acquisition

D: DEFINE THE PROBLEM • Steps: Define • Specific questions to be answered and resulting actions • Population or process of interest • Data we would ideally like to have (Wayne Nelson) • Washing machine design example: • Stated objective: Show within 6 months and with 95% confidence that following goals can be met: • 95% reliability in first year of operation • 90% reliability after five years • 80% reliability after ten years (“reliability” defined as no repair or servicing need) • Resulting actions: Proceed with full scale production (if validated) • Further question: How can we improve? • “Population”: 6 million machines to be built in next 5 years • Ideal data: Field repair and servicing needs for 6 million future machines

E: EVALUATE THE AVAILABLE DATA • Steps: • Understand the process and its physical basis • Analyze existing data • Ask: Is available data sufficient (and if not, what is needed)? • Washing machine design example • Participate in design reviews, FMEA’s (Failure Mode and Effects Analysis), etc. • Analyze • In-house test results on previous designs • Field data on previous designs • Component and sub-assembly test results (e.g., motor testing) • Conclusions • Previous design does not meet current reliability goals • Proposed new design corrects many past problems • Possible concern: Introduction of new failure modes • Component and sub-assembly test results look promising • No information about system performance in realistic environment—need such information

U: UNDERSTAND DATA ACQUISITION OPPORTUNITIES AND LIMITATIONS • Steps: Gain understanding of • Data acquisition process, measurement error, etc. • Limitations in data acquisition • Limitations in inferencing • Washing machine design example • Data acquisition: Conduct in-house accelerated cycling of washing machines • Simulate 3.5 years of operation per month • Evaluate weekly for failures • Take apart at end of test and measure degradation • Limitations in data acquisition • 6 months of testing • 36 available test stands • 3 prototype lots • Limitations in inferencing • Assume prototype lots are from “same population” as high volume production • Assume failures, etc. are cycle dependent • Assume realistic simulation of field environment Conclusion: This is analytic (not enumerative) study; statistical confidence bounds only partially capture uncertainty

P:PLAN DATA ACQUISITION AND IMPLEMENT • Steps: Develop and evaluate specific strategy, including • Testing conditions or operational environment • Samples size and selection process • Assessment of sampling plan • Testing protocol • Pilot study • Washing machine example • Testing conditions: Run washing machines with full load of soiled towels, mixed with sand, wrapped in plastic bag • Sample size: 12 units each from 3 prototype lots • After 3 months • Remove 4 units from each of 3 lots and measure degradation • Replace with 12 units from 4th lot • After 6 months: To have 95% probability of demonstrating 80% reliability after 10 years in field with 95% confidence requires actual reliability to be 95%--or sample size of 96 if actual reliability is 90% (assuming Weibull distribution with shape parameter of 2.5) • Specify protocol, including high-precision measurements, definition of failure, data recording requirements, replacements of failed units, etc. • Pilot study: Three washing machines run for one week

M: MONITOR, CLEAN DATA, ANALYZE AND VALIDATE • Steps: • Clean data—as gathered • Monitor to ensure that process is being followed • Conduct preliminary analyses; determine whether process need be changed • Conduct final analysis • Validate: Propose appropriate validation testing • Washing machine design example • Clean data: Develop proactive checks for missing or inconsistent data that automatically query data provider • Monitor: Continued involvement • Analyze failure data after 1 week, 1 month and 3 months; identify problems for correction • Do final analyses after 6 months (failure and degradation data) • Validate: propose added programs: • Continue 6 of 36 units on test beyond 6 months • Beta test 100 machines with company employees and 60 in laundromats • Audit sample 6 production units each week: Test five for 1 week; one for 3 months • Develop system for capturing and analyzing field reliability data DISCIPLINED, TARGETED DATA ACQUISITION PROCESS

TEACHING DATA ACQUISITION: PROPOSAL • Preferred: Course in data acquisition as second required course in statistics for practitioners and aspiring statisticians • Compromise: Devote one third of one-semester introductory course to data acquisition • Industrial: Devote one third of short courses to data acquisition • In addition: Discuss data acquisition process and challenges in all data analysis examples P.S. Most courses on design of experiments and survey sampling cover only tip of iceberg and are offered to limited audience

PROPOSED COURSE IN DATA ACQUISITION: OUTLINE • Motivation: Need for good data and limitations of observational studies • Key concepts • Populations, sampling frames, processes, random (and other) samples • Analytic versus enumerative studies • Measurement error • Disciplined, targeted process for data acquisition (and examples) • Some formal approaches: • Design of experiments (including factorial, fractional factorial, response surface) • Survey sampling (including questionnaire construction, non-response problems) • Data acquisition systems (e.g., for SPC, field reliability, student performance assessment) • Some special studies and situations (e.g., life testing, dosage studies, attribute y’s) • Data acquisition as a learning process (Box et al) • Graphical data analyses • Sample size determination: Analytical and simulation approaches • In process data cleaning • Statistics in the news: Data acquisition considerations (Source: Laurie Snell Chance News-www.dartmouth.edu/~chance/) • Student generated studies and critiques (Source: Bill Hunter, 1977 American Statistician article “Some Ideas about Teaching Design of Experiments”)

ELEVATOR SPEECH • We need put the the horse (data acquisition) before the CART (data analysis) • Specific proposals • Formal process for data acquisition • High focus on training, including required course on data acquisition • To analyze data is human--to plan to gather the right data is divine P.S. • For copies of slides, contact gerryhahn@yahoo.com • Comments based upon chapter from Statistics in the Corporate World—Connecting the Dots (tentative title) to be published in 2004 (we hope) by Wiley—your inputs invited!

DATA ACQUISITION: FOCUSING ON THE CHALLENGE

DATA ACQUISITION: FOCUSING ON THE CHALLENGE

Presentation Transcript

Site development issues, focusing on site acquisition

Data Acquisition

Data Acquisition

Focusing on the Case

Data Acquisition

Focusing on the Fundamentals

The Socioeconomic challenge focusing particularly in the SME challenge

Data Acquisition

Data Acquisition

dATA acquisition

Focusing on the Future

Focusing on the Future

DATA ACQUISITION

Data Acquisition

FOCUSING ON NON-PROFESIONAL DATA USERS

data acquisition

Focusing on the issues

Data Acquisition

Data Acquisition

Data Acquisition

dATA acquisition