Dec 14, 2021
No image
Multi-Speaker Text-to-Speech Model for an AR/VR Startup in the Healthcare Industry
Completed

Multi-Speaker Text-to-Speech Model for an AR/VR Startup in the Healthcare Industry

$50,000+
2-3 months
United States, Austin, Texas
2-5
view project
Service categories
Service Lines
Artificial Intelligence
Software Development
Domain focus
Healthcare
Technology
Programming language
Python

Challenge

The client wants to integrate a 3D animated avatar inside their application that can synthesize speech based on an input audio clip. The avatar explains various medical procedures and terms to patients, and the generated voice must sound realistic, articulated, and precise with more difficult medical terms. They reached out to BroutonLab to develop an MVP Text-to-Speech engine custom-made for their use case, and the development was carried out within time and budget constraints.

Solution

In order to train the models, open-source datasets were explored, pre-processed, and combined. A voice encoder was used for extracting the voice embeddings. The architecture was trained on a speaker verification task. For the first text-to-speech model, an open-MIT-license TTS model leveraging diffusion probabilistic modeling was modified. In it, an encoder-decoder architecture with a score-based decoder produces mel spectrograms. A GAN-based architecture consisting of a generator and multiple discriminators. The generator is a fully convolutional neural network. The deliverable included a Flask server app containerized using Docker, with a simple REST API that manages access to the Text-to-Speech engine.

Results

The project was delivered within time and budget as an MVP, and it serves to showcase what kind of results can be achieved in a short time with limited resources. Future work will consist in acquiring custom datasets with high-quality speech to further improve the models.
The project was delivered within time and budget as an MVP, and it serves to showcase what kind of results can be achieved in a short time with limited resources. Future work will consist in acquiring custom datasets with high-quality speech to further improve the models.