Alibaba’s new AI system ‘EMO’ creates realistic talking and singing videos from photos

Illustrate an image that represents an advanced AI system, named EMO, in the style of digital and friendly animation, such as commonly found in late 20th century children's films. Show the AI converting audio into video, animating a portrait photo to realistically mimic talking and singing. To depict this, use elements such as fluctuating sound waves passing into a hi-tech device that has a calm portrait emerging from the other side showing various expressions. The portrait looks incredibly real and is talking in sync with the sound waves. Note that the aspect ratio should be 3:2 and the overall mood of the image should be positive and light.

Alibaba’s Institute for Intelligent Computing has developed an AI system called EMO that can animate a single portrait photo and generate realistic talking or singing videos. The system uses a direct audio-to-video synthesis approach, bypassing the need for 3D models or facial landmarks. EMO employs a diffusion model and was trained on a dataset of over 250 hours of talking head videos. It outperforms existing methods in terms of video quality, identity preservation, and expressiveness. EMO can also generate singing videos with appropriate mouth shapes and facial expressions. The system can produce videos of arbitrary duration based on the length of the input audio. However, ethical concerns remain regarding the potential misuse of this technology. The researchers plan to explore methods to detect synthetic videos.

