Rec. ITU-T P.1401 (01/2020) Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models
Summary
History
FOREWORD
Table of Contents
1 Scope
2 References
3 Definitions
4 Abbreviations and acronyms
5 Conventions
6 Subjective test and objective algorithms
     6.1 Aspects related to subjective testing
     6.2 Aspects related to objective algorithms
7 Evaluation framework
     7.1 Data preparation
          7.1.1 Test and validation databases
          7.1.2 Data cleansing
     7.2 Analysis types
     7.3 Prediction on a numerical quality scale
          7.3.1 Comparing MOS values of different experiments
          7.3.2 Scale calibration of objective quality models
          7.3.3 Performance evaluation of objective measures and compensation for the variance between subjective experiments
     7.4 Uncertainty of subjective results
     7.5 Statistical evaluation metrics
          7.5.1 Absolute prediction error (rmse)
          7.5.2 Residual error distribution and outlier ratio
               7.5.2.1 Residual error distribution
               7.5.2.2 Outlier ratio
          7.5.3 Pearson correlation coefficient
     7.6 Statistical significance evaluation
          7.6.1 Significance of the difference between the correlation coefficients
          7.6.2 Significance of the difference between the outlier ratios
          7.6.3 Statistical significant difference between probabilities of exhibiting errors below a pre-defined threshold
          7.6.4 Significance of the difference between the root mean square errors
          7.6.5 Significance test in the case of multiple comparisons
     7.7 Statistical evaluation in the context of subjective uncertainty: epsilon insensitive rmse and its statistical significance
     7.8 Statistical evaluation of the overall performance
          7.8.1 Databases weighting
          7.8.2 Aggregated statistical significant distance measure
          7.8.3 Statistical significance of the aggregated SSDM
8 Guidance on algorithm selection
     8.1 Per experiment performance
     8.2 Overall figure of merit
     8.3 Worst performance cases
     8.4  Averaging statistical metrics across experiments
9 Special cases
     9.1 Evaluation of algorithms with more than one output
     9.2 Evaluation of algorithms against pre-defined minimum performance requirements
     9.3 Scenarios using prediction error as unique statistical metric and datasets have different number of samples
          9.3.1 Aggregated performance measure
          9.3.2 Statistical significance of the aggregated performance measure
10 Demonstration cases
Appendix I  Algorithm mapping to the subjective scale
Appendix II  The impact of the third order versus first order mapping
     II.1 Application of third order and first order mappings
     II.2 Gain of third order mapping
Appendix III  Confidence intervals calculation
     III.1 The standard deviation for file-based analysis
     III.2 The standard deviation for condition-based analysis
     III.3 Exceptional cases
Appendix IV  Normality test
Appendix V  Statistical significance of the rmse_tot* across all experiments
Bibliography