Multi-Speaker Text-to-Speech Model for an AR/VR Startup in the Healthcare Industry

The client wants to integrate a 3D animated avatar inside their application that can synthesize speech based on an input audio clip. The avatar explains various medical procedures and terms to patients, and the generated voice must sound realistic, articulated, and precise with more difficult medical terms. They reached out to BroutonLab to develop an MVP Text-to-Speech engine custom-made for their use case, and the development was carried out within time and budget constraints.

In order to train the models, open-source datasets were explored, pre-processed, and combined. A voice encoder was used for extracting the voice embeddings. The architecture was trained on a speaker verification task. For the first text-to-speech model, an open-MIT-license TTS model leveraging diffusion probabilistic modeling was modified. In it, an encoder-decoder architecture with a score-based decoder produces mel spectrograms. A GAN-based architecture consisting of a generator and multiple discriminators. The generator is a fully convolutional neural network. The deliverable included a Flask server app containerized using Docker, with a simple REST API that manages access to the Text-to-Speech engine.

The project was delivered within time and budget as an MVP, and it serves to showcase what kind of results can be achieved in a short time with limited resources. Future work will consist in acquiring custom datasets with high-quality speech to further improve the models.

Challenge

Solution

Results