Udacity Developed an AI to Turn Instructor’s Speech into Deepfake-Style Video, But Has No Plans to Use It

Deepfakes—the practice of using AI practices of deep learning and neural networks to create fake videos—have been circulating on the internet since 2017. Many have grown alarmed recently with their increasing sophistication and the resulting implications for the spread of misinformation. Udacity, the nanolearning provider, has recently sought to use similar technology to cut the time and effort needed to record and update online courses. On July 4, two former members of their AI development team, Byung-Hak Kim and Varun Ganapathi, released a paper on arXiv.org presenting their work conducted at Udacity, which they call LumièreNet.

The main purpose of LumièreNet is to allow instructors to make important updates to their courses without needing to head back into the studio, along with a video production team, to re-record course segments.

If successful, the application would instead allow instructors merely to record audio of the updated material. Using the previously recorded images and the instructor’s speech, LumièreNet could then recreate a video to use for the course.

LumièreNet Was Developed at Udacity, but the Company Has No Plans to Put It to Use

Since its development, Udacity has since walked away from using LumièreNet. Kim and Gunapathi, furthermore, have left to start their own company.

“LumièreNet was developed by Byung-Hak Kim and Varun Ganapathi, two independent researchers and former Udacity employees,” a spokesperson said via email. “While it’s true that Hak and Varun developed and tested the algorithm using Udacity content (Udacity provided the videos and instructors gave consent), the algorithm neither belongs to Udacity nor do we use [it] to develop our content. Udacity’s classroom content is filmed at Udacity’s studio and built in collaboration with expert instructors.”

The researchers differ from others creating AI applications to make video replications of real people in one notable way: Kim and Ganapathi decided it would be simpler and better to approximate their video recreations instead of making super realistic copies.

They describe the problem they’re trying to solve as trying to “learn a function to map from an audio sequence recorded by an instructor to video sequence.”

To go about solving this, they write, “[W]e consider neural network modules as a randomly chosen instance of this problem based on this probabilistic generative model. A key advantage is that each neural network model represents a single factor which separates the influence of other networks that can be trained and improved independently.”

With the audio and image inputs, LumièreNet uses three neural network models: BLSTM, a variational auto-encoder, and the generative adversarial network known as Pix2Pix.

Watch a video demonstration of LumièreNet here.

The authors consider this a simplification of other research, such as the now-notorious 2017 study that deepfaked former President Barack Obama, which they cite in their paper.

The researchers also decided to approximate other key motions, such as blinking, and eye and hair movement.

To test out their deep learning function, the authors recreated a course video shoot. They brought in an instructor to record a 4+ hour video. They then took the audio and ran it through their application to compare results.

Initial Tests Went Well

In their paper, Kim and Gunapathi report positive findings with several attributes of their AI. It’s especially successful when the instructor faces the camera head on. But they also struggled in other areas. Hands and fingers were especially difficult and tended to blur together when they were folded together.

While this work was created to improve course production, the authors acknowledge in their introduction that they need to exercise caution using their AI for other reasons.

“Even though our approach is developed with primary intents to support agile video content development which is crucial in current online MOOC courses, we acknowledge there could be potential misuse of the technologies,” the authors write. “Nonetheless, we believe it is crucial synthesized video using our approach requires to indicate as synthetic and it is also imperative to obtain consent from the instructors across the entire production processes.”

Kim and Gunapathi left Udacity to form their own AI startup in April. Read the full paper here.

Featured Image: TechCrunch, Flickr.

Correction July 9: A previous version of this article stated that Gunapathi and Kim currently work at Udacity. The two researchers left the company to form their own startup in April. A previous version also indicated that Udacity owned the software. While LumièreNet was developed by the company, it does exercise ownership over the software.