AI learned to restore a song from a music video

A new model of artificial intelligence (AI) is able to view a video without sound, where a musician plays a song on an instrument, and restore this composition. In the future, this technology will use body movements to restore speech and other sounds.

Scientists at MIT have unveiled Foley Music, an artificial intelligence (AI) system that generates music based on silent videos where musicians play instruments. They say the model works with a variety of musical instruments and outperforms several existing systems in terms of speed and performance.

The researchers believe that an AI model that creates music based on human movements could be the basis for several applications, from automatically adding sound effects to videos to creating immersive virtual reality experiences. The researchers note that people also have this skill – for example, when they understand a person’s speech by their lips.



Foley Music draws attention to key points of the body (25 points) and fingers (20 points) as intermediate visual anchor points, which she uses to simulate body and arm movements. The system then translates these movements into musical notes, taking into account the volume. So it can play accordion, bass guitar, bassoon, cello, guitar, piano, ukulele, and other instruments.

In their experiments, the researchers trained Foley Music on three datasets containing 1,000 music video clips in 11 categories. So they were able to assemble a corpus of videos of varying complexity – instructions from the AtinPiano website, amateur videos from YouTube channels, excerpts from concerts, and other data.

The researchers uploaded 450 videos to the Foley Music system. Then they gave the resulting music to the scientists, who evaluated the result. In some cases, they noted that “the music is like a cover from a quality band.”

Experts have found that Foley Music’s generated music is difficult to distinguish from actual recordings. What’s more, AI can improve audio quality, semantic alignment, and timing.

Tags: