1 / 35

Humans in the Loop: Testing for Human Factors in System Performance with Power

2. Bottom Line Up Front. Test design is not an art ? it is a scienceTalented scientists in RAT enterprise however?limited knowledge in test design?alpha, beta, sigma, delta, p,

whitfield
Download Presentation

Humans in the Loop: Testing for Human Factors in System Performance with Power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1

    2. 2 Bottom Line Up Front Test design is not an art it is a science Talented scientists in RAT enterprise howeverlimited knowledge in test designalpha, beta, sigma, delta, p, & n Our decisions are too important to be left to professional opinion aloneour decisions should be based on mathematical fact 53d Wg, AFOTEC, AFFTC, and AAC experience Teaching DOE as a sound test strategy not enough Leadership from USAF senior executives is key Purpose: USAF adopt experimental design as the default approach to test, wherever it makes sense*

    3. 3 Background -- Greg Hutto B.S. US Naval Academy, Engineering - Operations Analysis M.S. Stanford University, Operations Research USAF Officer -- TAWC Green Flag, AFOTEC Lead Analyst Consultant -- Booz Allen & Hamilton, Sverdrup Technology Mathematics Chief Scientist -- Sverdrup Technology Wing OA and DOE Champion 53rd Wing, now 46 Test Wing USAF Reserves Special Assistant for Test Methods (AFFTC/CT) and Master Instructor in DOE for USAF TPS Practitioner, Design of Experiments -- 18 Years

    4. 4 Overview Deep & Broad Tests The C-17 Rhyme Test Whats the BEST method of test? DOE Fables for the HF Engineering How to deploy DOE to revolutionize test? Summary

    5. 5 A beer and a blemish 1906 W.T. Gossett, a Guinness chemist Draw a yeast culture sample Yeast in this culture? Guess too little incomplete fermentation; too much -- bitter beer He wanted to get it right 1998 Mike Kelly, an engineer at contact lens company Draw sample from 15K lot How many defective lenses? Guess too little mad customers; too much -- destroy good product He wanted to get it right

    6. 6 The central test challenge In all our testing we reach into the bowl (reality) and draw a sample of operational performance Consider a new cargo intercom comm system Suppose an historical 80% intelligibility rate Is the new intercom at least as good? We dont know in advance which bowl God hands us The one where the system works or, The one where the system doesnt

    7. 7 AFFTC Modified Rhyme Test Problem: grade speech intelligibility for A/C comm 5 x ea talker, listener reading 50 words each = 250 words Criteria: 80%+ recognition What do we think of this sample size?

    8. 8 Comm is key in C-17 Big Aircraft!

    9. 9 Start -- Blank Sheet of Paper Lets draw a sample of _n_ shots How many is enough to get it right? 3 because thats how much $/time we have 8 because Im an 8-guy 10 because Im challenged by fractions 30 because something good happens at 30! Lets start with 50 words and see

    10. 10 A false positive declaring Comm is degraded (when its not) -- a Suppose we fail Comm system when it has 36 or fewer correct words Well be wrong (on average) about 10% of the time We can tighten the criteria (fail on 37) by failing to field more good comm We can loosen the criteria (fail on 35) by missing real degradations Lets see how often we miss such degradations

    11. 11 A false negative we field Comm (when its degraded) -- b Use the failure criteria from the previous slide If we field Comm with 36 or more correct words, we fail to detect the degradation If comm has degrade, with n=50 words, were wrong about 35% of the time We can, again, tighten or loosen our criteria, but at the cost of increasing the other error

    12. 12 We seek to balance our chance of errors Combining, we can trade one error for other (a for b) We can also increase sample size to decrease our risks in testing These statements not opinion mathematical fact and an inescapable challenge in testing There are two other ways out factorial designs and real-valued MOPs

    13. 13 A Drum Roll, Please For a = b = 10%, d = 10% degradation in Correct Words N=121!

    14. 14 Recap First Challenge Challenge 1: effect of sample size on errors Depth of Test So -- it matters how many we do and it matters what we measure There is a 2nd challenge Breadth of testing searching the employment battlespace

    15. 15 Challenge 2: Broader -- How Do Designed Experiments Solve This? Statistician G.E.P Box said All math models are false but some are useful. All experiments are designed most, poorly.

    16. 16 Battlespace Conditions for C-17 Comm Case Seeking new transport comm system Operational Question: Does comm still perform at 80% intelligibility level?

    17. 17 Challenge 2: Systematically Search the Relevant Battlespace Factorial (crossed) designs let us learn more from the same number of assets We can also use Factorials to reduce assets while maintaining confidence and power Or we can combine the two

    18. 18 Actual Comm Test A Success System failed because of troop door talkers SME required to analyze implications/reason of this When troop door is factored out system performs at 83% This does not pass the system, it just identifies the source and magnitude of the problem

    19. 19 Historical DOE Timeline1

    20. 20 Many Strategies of Test have been tried .. Which is best? Remember Large Test Space: 2*2*2*2 = 29 = 512 combos

    21. 21 Pros and Cons of Various Test Strategies DOE the only way we know to test broadly and deeply with power!

    22. 22 So How to Test? The Best Way: Scientifically We understand guidance, aero, mechanics, materials, physics, electro-magetics DOE introduces the Science of Test

    23. 23 B-1 Radar Mapping in One Mission (53d Wg)

    24. 24 It applies to our tests: DOE in 36+ operations over 18 years IR Sensor Predictions Ballistics 6 DOF Initial Conditions Wind Tunnel fuze characteristics Camouflaged Target JT&E ($30M) AC-130 40/105mm gunfire CEP evals AMRAAM HWIL test facility validation 60+ ECM development + RWR tests GWEF Maverick sensor upgrades 30mm Ammo over-age LAT testing Contact lens plastic injection molding 30mm gun DU/HEI accuracy (A-10C) GWEF ManPad Hit-point prediction AIM-9X Simulation Validation Link 16 and VHF/UHF/HF Comm tests TF radar flight control system gain opt New FCS software to cut C-17 PIO AIM-9X+JHMCS Tactics Development MAU 169/209 LGB fly-off and eval Characterizing Seek Eagle Ejector Racks SFW altimeter false alarm trouble-shoot TMD safety lanyard flight envelope Penetrator & reactive frag design F-15C/F-15E Suite 4 + Suite 5 OFPs PLAID Performance Characterization JDAM, LGB weapons accuracy testing Best Autonomous seeker algorithm SAM Validation versus Flight Test ECM development ground mounts (10s) AGM-130 Improved Data Link HF Test TPS A-G WiFi characterization MC/EC-130 flare decoy characterization SAM simulation validation vs. live-fly Targeting Pod TLE estimates Chem CCA process characterization Medical Oxy Concentration T&E Multi-MDS Link 16 and Rover video test

    25. 25 Four DOE Stories for HF TAG Human Performance: Rating Pilot-Induced Oscillation in C-17 Requirements: SDB II Build-up SDD Shot Design Acquisition: F-15E Suite 4E+ OFP Qualification Test: Combining Digital-SIL-Live Simulations

    26. 26 C-17 PIO Test New Flight Control SW (Aug 05) Test Objective: C-17 notorious for PIO in tracking tasks New flight control software to improve behavior Difficult to measure pilot feel critical (black ice feeling)

    27. 27 Testing to Diverse Requirements: SDB II Shot Design (Fall 07) Test Objective: SPO requests help 46 shots right N? Power analysis what can we learn? Consider Integrated Test with AFOTEC What are the variables? We do not know yet How can we plan? What management reserve

    28. 28 Acquisition: F-15E Strike Eagle Suite 4E+ (circa 2001-02) Test Objectives: Qualify new OFP Suite for Strikes with new radar modes, smart weapons, link 16, etc. Test must address dumb weapons, smart weapons, comm, sensors, nav, air-to-air, CAS, Interdiction, Strike, ferry, refueling Suite 3 test required 600+ sorties

    29. 29 Strike Weapons Delivery a Cases Design Improved with DOE

    30. 30 Test: Reduce F-16 Ventral Fin Fatigue from Targeting Pod (Winter 07) Test Objective: blah

    31. 31 A Strategy to be the Best Using Design of Experiments Inform RAT Leadership of statistical thinking for Test Adopt most powerful test strategy (DOE) Train & mentor total RAT teams Combo of AFIT, Center, & University Revise AF RAT policy, procedures Share these test improvements

    32. 32 53d Wing Policy Model: Test Deeply & Broadly with Power & Confidence From 53d Wing Test Managers Handbook: While this [list of test strategies] is not an all- inclusive list, these are well suited to operational testing. The test design policy in the 53d Wing supplement to AFI 99-103 mandates that we achieve confidence and power across a broad range of combat conditions. After a thorough examination of alternatives, the DOE methodology using factorial designs should be used whenever possible to meet the intent of this policy.

    33. 33 We Train the Total Test Team but first, our Leaders OA/TE Practitioner Series 10 sessions and 1 week each Reading--Lecture--Seatwork Basic Statistics Review (1 week) Random Variables and Distributions Descriptive & Inferential Statistics Thorough treatment of t Test Applied DOE I and II (1 week each) Advanced Undergraduate treatment Graduates know both how and why

    34. 34 DOE Initiative Status Gain AAC Commander endorsement and policy announcement Train and align leadership to support initiative Commence training technical testers Launch multiple quick-win pilot projects to show applicability Communicate the change at every opportunity Gain AAC leadership endorsement and align client SPOs Influence hardware contractors with demonstrations and suitable contract language Keep pushing the wheel to build momentum Confront nay-sayers and murmurers Institutionalize with promotions, policies, Wing structures and practices Roll out to USAF at large and our Army & Navy brethren

    35. 35 Why: Its the scientific, structured, objective way to build better tests DOE is faster, less expensive, and more informative than alternative methods Uniquely answers deep and broad challenge: Confidence & Power across a broad battlespace Our less-experienced testers can reliably succeed Why not ... If DOE is so good, why havent I heard of it before? Arent these ideas new and unproven? But Why Should You Pursue DOE?

More Related