When will we get "Stable Diffusion for voice"
Never closes
We already have it
Later this year (2024)
2027 or later

I understand that modern text-to-speech uses AI to generate voice. But current TTS models are usually trained on only a single vocalist, and can only imitate the voice of the person they trained on. If you want a different voice, you have to use a different model, trained on a different vocalist.

As far as I can tell, there's no TTS software that can mix and match voices, to create new voices that aren't part of its training set. For example, you could ask the model to generate a "female chainsmoker with an Australian accent" and it would generate speech with that voice, even though there were no female chainsmoking Australians in its training data (there would of course be females, chainsmokers, and Australians in the training set, but not the intersection of all three).

I expect this technology will arrive soon, or perhaps it already has! Let me know if you have any leads.

Sort by:

Don't know which programm they use but the project lawful podcast on Spotify uses a great AI voice model. Have listened to about 77 hours of it. Only like 2 words they sometimes mispronounce.

Is there no way to delete comments?

@DavidFWatson I dont' think so.

To the people saying we already have it, where?

@GG have you tried Suno? https://app.suno.ai/create/. Their models will generate either words or music

@DavidFWatson This is a voice cloner. As in, you can upload a voice clip of a 60-year old, chain smoking woman with an Australian accent, and then have it read text in her voice. It does not let you describe a voice you'd like it to read in and then create one. Ditto Suno. We have impressive voice generation, but we don't have what's being asked for in the original poll description.

@inaimathi Thank you for investigating TortoiseTTS for me. You are correct, I am not looking for a voice cloner. The question is about when we will have software that can generate a new type of voice based on a text description, not simply one that can imitate specific type of voice its been trained on.

@GG It wasn't really an investigation; I've just been doing work in this space and had it top of mind :p (see https://inaimathi.ca/posts/this-blog-is-now-a-podcast and https://github.com/inaimathi/catwalk/blob/master/tts.py). FWIW, I voted 2025.

@GG You can dismiss those products by saying they're just a 'voice cloner' but as you can see from the votes, most people see those as the same thing. TortoiseTTS is distributed with samples of the training data, you can just use those if you don't want to provide your own.

Stable Diffusion: text -> image

Voice Cloner: text + built-in or provided audio -> audio
What you're looking for:
'SD for voice': text voice description + text to read -> audio

Hopefully you can see why folks are confused about your question: if someone gave you a TortoiseTTS with an alias so that it'd always just use builtin voice #3 or whatever, that'd be:
text -> audio