Measuring speech quality

Measuring the quality of voice communications
Of the various methods, the full reference or FR method is the most widely used. It yields the most reliable results, but one constraint is the need to have a distortion-free reference file for comparison.
Full reference methods use algorithms to evaluate speech samples by simulating the process of the human ear listening to reference audio files. Next, they compare the samples to determine the audible difference. The data then undergoes a process, called the cognitive model, comparable to the way the human brain would process such data. Lastly, a picture of overall voice quality is generated.

The diagram below represents the full reference model:
Over the years, several models for measuring the quality of voice over IP have been developed, such as PSQM (Perceptual Speech Quality Measure), recommended by the ITU from 1996 to 2001, PAMS (Perceptual Analysis Measurement System), and PESQ (Perceptual Evaluation of Speech Quality), the currently recommended model, is an optimized combination of PAMS and PSQM.

The table below sets forth the scale defined by the ITU:

• the noise index corresponds to the quantity of additional data (in frequency) when the degraded file presents an offset,
• the loss index corresponds to the quantity of missing data when there is an offset with respect to the reference file,
• the offset index corresponds to the delay between utterances.
These three indicators are expressed as a percentage with respect to the reference file.
• a Newtest for Voice robot simulating real user calls on any type of voice network:
– Public Switched Telephone Network (PSTN)
– Global System for Mobile Communications (GSM)
– Voice over IP (VoIP)
• a classic Newtest robot equipped with a softphone like Skype or X-Lite.
