Prudence Pitch

13:12 09/10/2019 |

Total post : 1,195

Microsoft Research team reveals a system that automatically generates standup lecture videos from audio narration

(Tech) Building on this and other work, a Microsoft Research team this week laid out a technique they claim improves the fidelity of audio-driven talking heads animations. Previous head generation approaches required clean and relatively noise-free audio with a neutral tone, but the researchers say their method - which disentangles audio sequences into factors like phonetic content and background noise - can generalize to noisy and emotionally rich data samples.



Underlying their proposed technique is a variational autoencoder (VAE) that learns latent representations. Input audio sequences are factorized by the VAE into different representations that encode content, emotion, and other factors of variations. Based on the input audio, a sequence of content representations are sampled from the distribution, which along with input face images are fed to a video generator to animate the face.

The researchers sourced three data sets to train and test the VAE: GRID, an audiovisual corpus containing 1,000 recordings each from 34 talkers; CREMA-D, which consists of 7,442 clips from 91 ethnically diverse actors; and LRS3, a database of over 100,000 spoken sentences from TED videos. They fed GRID and CREMA-D to the model to teach it disentangled phonetic and emotional representations, and then they evaluated the quality of generated videos using a pair of quantitative metrics, peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM).

The team says that their approach is on par, in terms of performance, on all metrics with other methods for clean, neutral spoken utterances. Moreover, they note that it’s able to perform consistently over the entire emotional spectrum, and that it’s compatible with all current state-of-the-art approaches for talking head generation.


Post new