Interactive voice: that was then, this is now, what’s next?
A few years ago, analysts were predicting that speech would replace touch within a few years as the most common way of interacting with smart devices.
That was then, this is now, and it’s safe to say that on speech hasn’t replaced touch just yet.
But the uses of voice are innumerable, and the popularity of voice-enabled devices and services has indeed been growing rapidly. To steer us through the many products available, articles compare apps and services – voice searches, VoIP, voice-controlled devices, speech recognition, text-to-speech (TTS) capabilities, etc.
Last month again, analysts predicted that speech would become the primary mode of interaction with smart services, this time among ‘customers’ (i.e. affluent Western professionals aged 30-45) who increasingly prefer voice assistants over going to the bank or shopping in person. Meanwhile, reviews of the big 4 new assistants (Amazon’s Alexa, Apple’s Siri, Google Assistant, and Microsoft’s Cortana) find that they are still not equal to real-life challenges at present.
A catalyst for interactive voice innovation
What does the future hold? One recent development sparks the imagination: Mozilla’s release of its open-source speech recognition model and voice dataset at the end of November 2017. What makes this kind of publicly-available technology and data so compelling? A quick look at how most voice-enabled services work helps to answer that question.
Speech-enabled devices and services: how they work
To be able to interact with users, a voice application must be able to listen and respond, like in any normal conversation. It should have “listening” (speech recognition) capabilities for input, as well as “speaking” (speech synthesis) capabilities for output.
To these capabilities, add speech-to-text (cf. Mozilla’s DeepSpeech) and text-to-speech. In other words, a voice application should be able to produce speech from text (read a text aloud), text from speech (e-mail dictation), and recognize spoken commands or questions and respond correctly (IVR, GPS, virtual assistant). The application may have to “learn” to recognize the speech of its user. But often a set of voice data can be used to “train” machine learning algorithms. The better the dataset, the better the app can recognize speech. This is why Mozilla’s Common Voice dataset is so valuable to innovation.
To sum up with a prediction of our own, we can look forward in the coming years to a sharp rise of speech-enabled products and services with new speech interfaces.
Measuring end-user experience on voice applications
All of these future products and services will be tested by R&D and Integration teams prior to release. But what happens when these products enter the real world? How well will they serve their users? These questions are being addressed right now on current voice technologies, and we can also look forward to doing so for emerging and future voice technologies.
Present-day mainstream voice apps are monitored by automated voice transactions which check both the “listening” and the “response” flows — for example, whether a bank’s customer interface correctly recognizes spoken input and correctly generates output (a voice response, an action, or a text). Measurements are also made of how long it takes to connect to the service, how long for the server to respond, etc. because voice application monitoring tools check the quality of the user’s experience on the system from end to end.
Diagnostics are generated from the results of proactive monitoring of all kinds of voice interfaces and services, like softphones, call centers, IVR, and other mainstream voice applications. The same approaches should be applied to future voice technologies because good end-user experience will continue to be a primary goal of the speech-enabled devices and services of tomorrow.