240 likes | 352 Views
Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing. Leigh Thorpe Nortel CTO Services Group VQEG Ottawa Meeting, Sept 10-14, 2007. Overview. Evaluation goals and analysis approach Database characteristics
E N D
Evaluaton of Objective Quality Estimators: Methods used with Voice Models & Implications for Video Testing Leigh Thorpe Nortel CTO Services Group VQEG Ottawa Meeting, Sept 10-14, 2007
Overview • Evaluation goals and analysis approach • Database characteristics • Subjective results: internal consistency • Specific characteristics of measurement: resolution & performance on specific types of impairment
Evaluation of Measurement Models Want to understand how well the model predicts quality as rated by users Need to assess performance against an evaluation database 1. How close are the predictions for a set of test cases to the subjective ratings for those same cases? 2. Does the model differentiate neighbouring points in the correct direction? Interested in three aspects of performance: Accuracy: is the model good at predicting the subjective rating Resolution/Monotonicity:
Three methods of analysis (1) Graphical: scatterplot and regression line • plot subjective scores on x-axis, objective measure on y-axis. • the spread of dots shows visually how closely the variables track each other and how close their relationship is to the ideal (the main diagonal) • by inspection can see how subgroups behave compared to overall performance. (2) The correlation coefficient • r, the Pearson Product-Moment Correlation, measures strength of linear relationship, the tendency for two variables to increase or decrease together • does not indicate how close the values of the two variables are • perfect correlation gives r = 1 or −1; no relationship gives r = 0 • the measurement units for the two variables may be same or different • the number of points and the dynamic range of the variables (difference from highest to lowest) will each affect the value of the correlation coefficent (3) The Standard Error of Estimates (SEE) • a measure of deviation of the dependent variable from its regression line • can compute a score for subsets of the conditions tested • SEE is a measure of deviation: smaller is better. The closer the points are to the line (the better the prediction), the smaller the SEE value. • SEE is a measure of dispersion similar to standard deviation, and behaves like standard deviation
Performance on subgroups of pointsWhat correlation tells us Computing the correlation coefficient for a subgroup can mislead us about how the subgroup relates to the overall group. * r = 0.83 * * * * * r = 0.94 * * * * * * * * * * * * * * * The red points show a different relationship between the variables than is seen for the overall group. The correlation for those points tells us about their relationship to each other, but not to the rest of the data. * * * *
What SEE tells us Analogous to a standard deviation, SEE is the square root of the average of squared deviations. It is the RMS deviation from the regression line for a given set of points. It can be calculated for any set of points with sufficient n, say n ≥ 6. * * * * * * * * * Compare two groups of points: SEE is smaller for the yellow deviations than for the red deviations. SEE is in the same units as the variable for which it captures the variation. For this example, SEE has the units of y. * * *
Evaluation Samples: The “Database” The evaluation database consists of: • a number of samples of the signal of interest • a mean subjective rating for each sample Ideally, • the database should contain samples (test cases) covering the full range of types and levels of impairments that the model will encounter in usage conditions. • single database: all subjects have rated all test cases • where multiple databases are used, • there should be sufficient common test cases across the databases to show whether the subjective ratings line up
Criteria used for new Voice Qual Database • Cover a broad range of impairment types and levels • different types of codecs, range of packet loss, • background noise (for these cases, noise is in the reference) • combinations of these: coding, noise, packet loss, tandeming • Two languages: English, French • Multiple talkers • eight---four per langage • Include conditions that will challenge candidate methods • time warping (temporal shift) and noise reduction • A large number of judgments to obtain stable scores • We used n = 60 for each sample
Effect of Truncating Quality Range r = 0.85 r = 0.53 This small range database is simulated from the above by restricting the range of subjective values. Care was taken in the simulation to keep the number of points about the same.) The range restriction reduced the correlation coefficient from 0.85 to 0.53..
Database details Languages tested separately; • listeners were native speakers of language heard Samples 6 – 8 sec duration • each made up of two unrelated sentences from same talker Four talkers per language; talkers crossed with conditions • 1304 samples (326 x 4) Test room ambient noise low Presented at nominal telephone listening volume Too many samples to complete in one session: • samples were divided across four test sessions • each session included one instance of each condition • the four talkers were represented equally in all sessions • therefore, every listener heard every test case, but not always with the same talker
Internal Consistency of Database: English English Database: Internal Consistency(Per condition means, arbitrary split) English samples. This is the upper limit of performance that can be detected with this database. r = 0.995 Other half The variability of these samples indicates a resolution of about 0.25 MOS, as would be expected for n = 30 (ie, half). R = 0.995 One half
Internal Consistency of Database: French r = 0.995 French samples R = 0.995
Subj Data Model A Model B Model C Model D English 0.93 0.92 0.85 0.90 French 0.90 0.90 0.78 0.87 Merged 0.91 0.90 0.82 0.83 Averaged* 0.93 0.92 0.84 0.91 Correlation Coefficient (r) by Algorithm This is the correlation for French and English scores averaged together, not the average of the correlation coefficients!
r = 0.93 Results for Model A The spread of these points shows that Model A can resolve subjective quality to no better than about 0.5 MOS.
r = 0.84 Results for Model C This model shows a tendency to compress the range of its output score, relative to the subjective scores. There are a number of outliers in the lower left quadrant. The mid-range resolution is about 3/4 MOS.
Model A Model B Model C Model D MNRU 0.41 0.49 0.29 0.30 Codecs 0.27 0.29 0.29 0.18 Random Packet Loss 0.23 0.31 0.23 0.29 Constrained Random PL 0.24 0.34 0.22 0.29 Bursty Packet Loss 0.16 0.23 0.32 0.33 Constrained Bursty PL 0.20 0.26 0.21 0.17 Combined Temporal Clipping 0.30 0.32 0.22 0.29 Noise 0.22 0.32 0.41 0.27 Noise + Packet Loss 0.25 0.32 0.40 0.26 Noise Reduction 0.22 0.32 0.28 0.21 Overall 0.23 0.30 0.30 0.26 Example of results for subgroupsSEE* values *based on means across languages
What can we learn from the voice metric testing that can assist in evaluation of video metrics? 1. Ensure the use of a range of quality in the subjective test samples (next slide). • this can affect the correlation observed 2. Include all the impairments you are going to want to assess with the model, or that may be encountered in signals that pass through networks. 3. Within reason, any subjective metric can be used, as long as it is sufficiently sensitive to the variation in quality over the range used. It doesn’t need to be MOS. 4. Collect data from as many viewers as practicable • n> 30 if possible 5. Examine internal consistency of subjective ratings 6. Examine performance of the models on subgroups within the data • select a statistic that provides an unbiased result. • (r is not unbiased in this application). • SEE statistic provides credible alternative 7. Examine resolution and monotonicity • quantitative metrics??
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Interpretating regression and correlation Weak relationship: the points fall far from the line, and the cloud of points is about as long as it is wide. It looks as though a line on any direction would be as good. Strong relationship and the line is very similar to the diagonal: on average, the objective measure is closely tracking subjective score. For MOS prediction, this is the most desireable result. Strong relationship, but the line is canted relative to the diagonal: the objective measure is using a smaller range than the subjective score. Note: the value of the correlation coefficient does not indicate whether the line tracks the diagonal. Deviation from linear: the objective measure follows the diagonal for the lower portion, but underestimates the quality of the conditions in the upper range. We can compute a regression line, but it will not account for the non-linearity. We could compute a best fit curve, but there is no “correlation” statistic to indicate the strength of a non-linear relationship.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * r = 0.94 r = 0.93 Working with correlation (1) Correlation coefficients cannot be averaged. Why not? Databases A & B Merged Database A Database B r = 0.92 r = 0.65 Correlation is not a linear process, and so the correlations cannot be treated with linear operations (like averaging).
Nortel DatabaseSummary of Impairment Conditions Category No. of Cases Range of Quality Clean 2 High quality only MNRU 7 5 - 35 dBQ Codecs 7 G.711, G.729, AMR, tandem Random Packet Loss 54 1% - 10% PL, 10, 20, 30 ms packets Constrained Random PL 22 same speech & mask for each codec Bursty Packet Loss 54 1% - 10% PL, 10, 20, 30 ms packets Constrained Bursty PL 22 same speech & mask for each codec Temporal Clipping 21 15-60 ms clip, +/-80 ms shift, 120 ms mute Noise 33 20, 10, 0 dB SNR, Hoth, car, babble, street Noise + Packet Loss 54 2%, 4%, random & bursty Noise Reduction 48 good and poor noise reduction algorithm Total 326 326 cases x 4 talkers x 2 languages = 2608 test samples in the database
Results for Model A by subgroup English