160 likes | 173 Views
Ensure customer acceptance of audio quality through subjective tests before production. Discuss test design, statistical reliability, conversational task, and assessment methods for accurate results.
E N D
Subjective Sound Quality Assessment of Mobile Phones for Production Support Thorsten Drascher, Martin Schultes Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction, 8th and 9th June 2004 - Mainz, Germany
Introduction • The goal of the tests presented in this talk is to ensure customer acceptance of audio quality by statistically approved data. • Customers rate the sum of • Echo cancellation, noise reduction, automatic gain control, … • Contradicting to ancillary conditions of: • Short time (No waste of production capacities) • Low cost • Only limited correlation of objective measurements and subjective sound perception. • Execute subjective audio quality tests before the release for unrestricted serial production • Former results often not reliable due to friendly users and too few tests to guarantee statistical approval Introduction Presentation Outline Test Design • Laboratory or in- situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Subjective Audio Quality Assessment, June 2004
Presentation Outline • Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical reliability • First Test Presentation • Overall Quality • Most Annoying Properties • Discussion & Outlook Introduction Presentation Outline Test Design • Laboratory or in- situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Subjective Audio Quality Assessment, June 2004
Test Design Typical conversation situations for a mobile phone • Single Talk • Double talk Two different test subject groups • Naive users • Expert Users Different recommended test methods • Absolute category rating • Comparative category rating • Degraduating category rating • Threshold Method • Quantal-response detectability tests Introduction Presentation Outline Test Design • Laboratory or in- situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Subjective Audio Quality Assessment, June 2004
Test Design (ctd.) • Naive user tests will be carried out as single talk and double talk. Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Naive user tests Absolute category rating of overall quality and collecting most annoying properties. Trained user tests Comparative category rating of different parameter sets on most annoying properties (in parallel further parameter alteration) no Satisfying results? Evaluation yes Unrestricted Serial production Subjective Audio Quality Assessment, June 2004
Laboratory or in-situ tests? in-situ Nothing is more real than reality More interesting for test persons Large effort Difficult controlling Time intensive • Laboratory • Good controlling • Small effort • Reproducible conditions • Easy control of environmental conditions • Some effects have to be neglected • Psychological influence of laboratory environment on test results Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook • Laboratory tests are much more cost-effective than in-situ tests. • But: How close can reality be rebuilt in laboratories? • There should be at least one comparison between laboratory and in-situ. Subjective Audio Quality Assessment, June 2004
Laboratory test design Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Reproducible playback of previously recorded environmental noises as diffuse sound field Terminal A: fixed network, hand held, specified, silent office environment (e.g. according to ITU-T P.800) Terminal B: mobile or carkit under test Car Noise Babble Noise Silence • Single and double talk tests are carried out using different noise levels • Roles within the tests are interchanged • Rating interview with both test subjects Subjective Audio Quality Assessment, June 2004
Conversational Tasks • Properties of short conversation test scenarios (SCTs) • Typical conversation tasks • Ordering pizza • Booking a flight • Conversation lasts about 2 ½ min • Extended to about 4 min by following interview • SCTs are judged as natural by test subjects Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Formal structure caller called person Greeting Enquiry Question Precision Offer Order Information Treating of Order Discussion of open question Farewell [S. Möller, 2000] Subjective Audio Quality Assessment, June 2004
Statistical Reliability • Moments of interest are the mean and the error of the mean • Error of the mean is a function of the standard deviation • Worst case approximation: • Error of the mean is maximised if supreme and inferior ratings are given with relative frequency of 50% • An error of the mean accounting less than 10 % of the rating interval width is guaranteed after 30 tests 30 tests of 4 min each, resulting in an overall test duration of 2 hours • Tests with 3 different background noises at 3 different levels and in silent environment can be carried out in 40 h (1 week) over 2 different networks Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Subjective Audio Quality Assessment, June 2004
First Test Presentation • Internal fair at the beginning of May • Non representative, just “testing the test“ • Background: babble noise ~70dB(A) • Terminal under test: • Known to be too silent (not known by test subjects and experimenter) • Development concluded • interview only for the mobile terminal user (19 subjects) • Naive user tests with two questions • What is your opinion of the overall quality of the connection you have just been using? • What were the most annoying properties of the connection you have just been using? • Results given as • Numbers on a scale from 0 to 120 • Predefined answers without technical terms (adding new ones was possible) Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion& Outlook Subjective Audio Quality Assessment, June 2004
Overall Quality Bad Poor Fair Good Excellent 0 120 • Numbers invisible for test subjects • Average overall rating: 74 ± 4 • (62 ± 3)% of rating interval width • Start value 60 with highest relative frequency • To compare the internal scale with standard MOS ratings, a normalisation is required Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion& Outlook Subjective Audio Quality Assessment, June 2004
Overall Quality Bad Poor Fair Good Excellent 0 120 1 2 3 4 5 • MOSc: MOS rating intervals with scale labels in the center • Extreme value 5 rated 5 times (>25 %) • Extreme value 1 never assigned • Average overall rating: 3.8 ± 0.2 • (70 ± 5)% of rating interval width Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion& Outlook Subjective Audio Quality Assessment, June 2004
Overall Quality Bad Poor Fair Good Excellent 0 120 1 2 3 4 5 • MOSl: MOS rating intervals with scale labels at the lower end • Complete range is used • Extreme value 5 rated twice • Average overall rating: 3.3 ± 0.2 • (58 ± 5)% of rating interval width Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion& Outlook Subjective Audio Quality Assessment, June 2004
Most Annoying Properties • My partner‘s voice was too silent • Loud noise during the call • I heard my own voice as echo • My partner‘s voice was reverberant • My partner‘s voice sounded robotic • I heard artificial sounds • *My partner‘s voice sounded modulated • *My partners voice was too deep • I heard my partner‘s voice as echo • My partner‘s voice was too loud • *) Properties added during test • About 50% of test subjects regarded the partner‘s voice as too silent (known before, but not by the subjects and the experimenter) • 7 of 8 test subjects regarded the environmental noise as annoying property Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook 9 8 1 1 1 1 1 1 Subjective Audio Quality Assessment, June 2004
Discussion & Outlook • A short-time intensive subjective test method and a first test were presented. • After ratings of 19 test subjects • the error of the mean overall quality was assessed to about 3 % of rating interval width • statistical approval of being too silent • Questions and predefined answers have to be chosen very carefully • Scale rating normalisation to MOS is a non trivial problem • Next steps: • Comparison of laboratory and in-situ tests • Tests of terminals and car kits currently in development state. Introduction Presentation Outline Test Design • Laboratory or in-situ tests? • Laboratory test design • Conversational task • Statistical Reliability First Test Presentation • Overall Quality • Most Annoying Properties Discussion & Outlook Subjective Audio Quality Assessment, June 2004