The new AI generates a voice in 500 milliseconds. It was developed by Facebook engineers, they said that their method is several tens of times faster than their analogs.
The social network Facebook introduced a highly effective AI-based system that quickly converts text to speech. It can be used in real-time and using conventional processors. Researchers talked about a new approach to collecting data – is allowed to produce a second of sound in 500 milliseconds.
Facebook will be able to produce high-quality voices without the need for specialized equipment. Specialists of the company noted that the system has reached 160 times acceleration in comparison with analogs. This will make it suitable even for devices with limited computing capabilities.
The Facebook system consists of four parts, each of which focuses on different aspects of speech: linguistic, pronunciation features, acoustic model, and neural voice encoding.
AI converts the text into a sequence of linguistic chains – sentences and units of sound that differ from each other depending on which word they are used in. The model is also responsible for the features of origin and style – AI can interpret and predict the rhythms of speech, sentences, and frequencies.
Embedding styles allow the system to create new voices – “soft”, “fast”, “formal”, and only a small amount of data is required to change them. Each style only takes 30 to 60 minutes, according to Facebook – an order of magnitude less than the few hours of recordings that are needed for a similar Amazon system.