Humans can isolate a single voice in a crowd and is usually good at doing so, but how about computers? Well... Not so much. Just ask anyone trying to talk to a smart speaker at a house party. Google may have a surprisingly straightforward solution. The researchers have developed a deep learning system that can pick out a specific voice by looking at people's faces when they are speaking. The team trained its neural network model to recognise individual people speaking by themselves and then created virtual "parties" to teach the AI how to isolate multiple voices into distinct audio tracks.
The results, which can be seen in the video below, are uncanny. Even when people are clearly trying to compete with each other (such as comedians) the AI can generate a clean audio track for one person just by focusing on their face. That's true even if the person partially obscures their face with hand gestures or a microphone.
Google is exploring opportunities to use this feature in its products, but there are more than a few prime candidates. It is potentially ideal for video chat services, where it could help users understand someone talking in a crowded room. It could also be helpful for speech enhancement in video recordings. There are big implications for accessibility: it could lead to camera-linked hearing aids that boost the sound of whoever's in front of you, and more effective closed captioning. However, there are potential privacy issues (it could be used for public eavesdropping) but it would be too difficult to limit the voice separation to people who've clearly given their consent.