Dong Yu, of Microsoft Research Redmond, and Frank Seide, of Microsoft Research Asia, have extended the state of the art in real-time, speaker-independent, automatic speech recognition to produce the Deep-Neural-Network Speech Recognition based MAVIS. MAVIS (Microsoft Audio Video Indexing Service) indexes audio and video content and then allows you to search for a particular word or phrase across the entire content and jump to the exact point when it was uttered. Now MAVIS is being updated with the concepts and algorithms of Deep-Neural-Network Speech Recognition. This is the first time any company has released a deep-neural-networks (DNN)-based speech-recognition algorithm in a commercial product. People have gone to the extent of calling it a breakthrough and“Holy Grail of speech recognition”.
Artificial neural networks are mathematical models of low-level circuits in the human brain. Though, this concept has been in use for speech recognition for more than 20 years, computer scientists gained access to enough computing power to make it possible to build models only a few years ago.
The Deep-Neural-Network Speech Recognition systems are characterized by improved accuracy and faster processor timing. So, using this technology, users can take advantage of these characteristics and improve the performance. Demos have shown a 10-to-20 percent relative error reduction and it also uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models. Though these characteristics are good and appealing, they are not what we can call the best part of the technique.
The best part of this whole process is that it does not require any sort of “speaker adaptation.” The presently used systems require speaker adaptation. In this process, an audio file is recognized multiple times, and after each time, the recognizer “tunes” itself a little more closely to the specific speaker or speakers in the file, so that the next time, it gets better. Obviously this is an expensive process. So using this new algorithm will bring down the costs.
A speech recognizer is nothing but a model of fragments of sounds of speech. An example of such sounds is “phonemes,” the roughly 30 or so pronunciation symbols used in a dictionary. State-of-the-art speech recognizers use shorter fragments, numbering in the thousands, called “senones.” This particular research took a leap forward the group proposed modeling the thousands of “senones”, much smaller acoustic-model building blocks, directly with DNNs.
By modeling senones directly using DNNs, the system outperformed state-of-the-art conventional speech-recognition systems by more than 16%. This kind of a service is critical in mobile scenarios, at call centers, and in web services for speech-to-speech translation.
Via: Microsoft Research