MAVIS Upgrade Gets Deep-Neural-Network Speech Recognition

Dong Yu, of Microsoft Research Redmond, and Frank Seide, of Microsoft Research Asia, have extended the state of the art in real-time, speaker-independent, automatic speech recognition to produce the Deep-Neural-Network Speech Recognition based MAVIS. MAVIS (Microsoft Audio Video Indexing Service) indexes audio and video content and then allows you to search for a particular word or phrase across the entire content and jump to the exact point when it was uttered. Now MAVIS is being updated with the concepts and algorithms of Deep-Neural-Network Speech Recognition. This is the first time any company has released a deep-neural-networks (DNN)-based speech-recognition algorithm in a commercial product. People have gone to the extent of calling it a breakthrough and“Holy Grail of speech recognition”.

#-Link-Snipped-#

Artificial neural networks are mathematical models of low-level circuits in the human brain. Though, this concept has been in use for speech recognition for more than 20 years, computer scientists gained access to enough computing power to make it possible to build models only a few years ago.

The Deep-Neural-Network Speech Recognition systems are characterized by improved accuracy and faster processor timing. So, using this technology, users can take advantage of these characteristics and improve the performance. Demos have shown a 10-to-20 percent relative error reduction and it also uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models. Though these characteristics are good and appealing, they are not what we can call the best part of the technique.

The best part of this whole process is that it does not require any sort of “speaker adaptation.” The presently used systems require speaker adaptation. In this process, an audio file is recognized multiple times, and after each time, the recognizer “tunes” itself a little more closely to the specific speaker or speakers in the file, so that the next time, it gets better. Obviously this is an expensive process. So using this new algorithm will bring down the costs.

A speech recognizer is nothing but a model of fragments of sounds of speech. An example of such sounds is “phonemes,” the roughly 30 or so pronunciation symbols used in a dictionary. State-of-the-art speech recognizers use shorter fragments, numbering in the thousands, called “senones.” This particular research took a leap forward the group proposed modeling the thousands of “senones”, much smaller acoustic-model building blocks, directly with DNNs.

By modeling senones directly using DNNs, the system outperformed state-of-the-art conventional speech-recognition systems by more than 16%. This kind of a service is critical in mobile scenarios, at call centers, and in web services for speech-to-speech translation.

Via: #-Link-Snipped-#

Replies

You are reading an archived discussion.

Related Posts

What makes iPhone a success is not only the awesome hardware & software combo Apple engineers put inside but also the whole accessory ecosystem built around Apple products. The gateway...
With the ever-growing use of smartphones, video sites and SNS, Japan has felt a need to strengthen its network infrastructure. After the East-Japan's earthquake disaster from march 2011 , Japan...
One billion pixel snapshot. This was the aim of an engineers team from Duke University. And successful they were in capturing it. The team developed a camera called AWARE-2 with...
'Guns and Roses' plays in the background as I write about Lily pollen (ah! not rose) being coated in bullets to help forensic teams track the shooter. Paul Sermon, a ...
Tumblr's version 3.0 has been released for iPhone. A lot of remarkable features have been included in the new version. There is support for high-resolution images. The user interface is...