View Feed
Coffee Room
Discuss anything here - everything that you wish to discuss with fellow engineers.
12893 Members
Join this group to post and comment.
Kaustubh Katdare
Kaustubh Katdare • Mar 19, 2017

Lip-reading AI system from Oxford shows humans how to do it better

Have you ever tried to guess what a person on TV is saying, while the volume is muted? It's hard! Well, a new artificially intelligent (AI) system developed by Oxford scientists can do it way better than competing humans. While the expert humans could guess only 12% of the words correctly, the Oxford system got nearly 50% of the words right. Of course, the scientists trained the AI system by letting it watch thousands of hours of BBC news footage.

This newly developed AI system is called "Watch, Attend and Spell" aka WAS and it was developed in association with Google's DeepMind division. Oxford Ph.D student Joon Soon Chung informed about the complexity of the challenge. Consider most frequently used words like 'mat', 'cat', 'bat and 'pat'. Humans make similar mouth shapes for these words and it's extremely difficult to guess what word is being said.


In order to address this challenge, the scientists made the system understand what words usually came together. The system learns from figuring out the series of mouth-shapes and makes smart guesses about the words that will follow. BBC provided their content from popular news programs along with subtitles. A neural network was then programmed to use this visual data to learn what words are being said and what mouth shapes were associated.

An enhanced lip-reading technology will allow for better accuracy and speed in speech-to-text service. Perhaps, it would lead to real time speech to text where the system observes the mouth and uses them to create sub-titles. Existing speech to text systems are not as effective in noisy environments. The lip-reading system would solve that problem very effectively.

Lip-reading is very hard. According to Oxford computer science researchers, hearing-impaired lip readers could achieve about 52% accuracy while the Georgia Tech researchers found out that only 30% of the speech can be seen on the lips. With deep-learning systems that keep improving themselves, the problem can be solved to a greater extent.

Refer to the original research paper published by Oxford researchers for details of the study: "Lip Rading in the Wild"

Source: BBC

Share this content on your social channels -