Neural Voices
Neural voices.
Neural voices are all the rage in the text-to-speech world. Neural voices are what sophisticated text-to-speech systems use to synthesize natural sounding speech with computers. As the name implies, neural voices utilize neural networks to perform this synthesis. This article discusses neural voices and how punctuation plays a role in neural network speech synthesis.
Neural Networks
First, neural networks though.
I don’t want to spend too much time on them here, but neural networks, in short, are large computer systems… These systems enable “deep learning” and other forms of machine learning (where computer systems use exceptionally massive amounts of data to recognize and replicate patterns). With the right algorithms in place, feeding a deep learning system data allows the system to “learn,” and the more it learns, the smarter (better at recognizing and replicating patterns) it becomes.
If you are interested to learn more, here are some links to check out:
- But what is a Neural Network? | Deep learning, chapter 1 – This video and its sequel provides a nice breakdown of neural networks.
- Explained: Neural Networks by MIT News – In case you might be interested in some of the history of neural networks.
- What’s a Neural Network – Another link to checkout
Back to Neural Voices
In the case of neural voices, neural networks replicate speech patterns and construct speech using phonemes, the most basic unit of speech, along with other building blocks like tone. These speech outputs by neural systems are able to sound much more like human speech because the neural computer systems emulate the same or similar patterns during speech synthesis that they “learn” from speech samples provided as part of learning/training data.
Can you explain that last sentence?
In an attempt to explain, let me explain. Basically (and I mean at a basic conceptual level), neural speech systems are fed large amounts of learning/training data: think of samples of speech paired with the corresponding string of text representing that speech. For example, the training data might include a key-value pair where the key is a string of text,
“Hey man, what’s going on?”
The value in that key-value pair is a recording of someone actually saying that phrase:
After comparing millions if not billions if not trillions of pieces of training data like the key value pair above, the system is able to identify patterns of speech based on the text it is associated with. The neural system then builds speech on its own from a string of text using the building blocks at its disposal (phonemes, tones, etc.) attempting to use the pattern that it has identified from its training.
In reality, training models are much more complex, and trainers need to worry about handling bias, creating sustainable learning models, positive/negative reinforcement, and so on. This ai.googleblog article goes into some of these training concepts in its discussion of the neural networks behind Google Voice.
Finally, as is the case with most machine learning systems, designers and developers adjust the algorithms that the systems use to help the system achieve the desired result.
All the big players are doing it
Google, AWS, IBM, and Microsoft are all leveraging neural networks to create neural voices.
A couple of the articles above are some years old now, so this isn’t new news, but it still has yet to fully reach the mainstream. I know for a fact that as of writing this article and recently building type-recorder, the documentation for the APIs that I use are very recently developed – in fact, some of the SDK functions were not finished being built out yet! That means that this stuff (text-to-speech) is just now being made available to the broader audience of developers to experiment with and extrapolate on, and that is exciting!
So… Punctuation?
Can you hear the differences between the above three recordings (from type-recorder 😉😎 using the en-US | Male | Neural voice)? Compare how each recording sounds to the punctuation of the text it was synthesized from. Can you spot a difference between how the first recording sounds and the third one? Listen to how “So” is said at the beginning of both.
As deep learning has become and becomes more sophisticated, these complex neural systems utilize more features and building blocks to synthesize human-like speech including punctuation and even the words themselves. Take a look at the second recording (the one without punctuation); even without punctuation, the system knows that the word “what” is a question word, so the synthetic speaker raises the tone of its voice when “what” is present (as an English speaking human would do) at the end of the sentence to indicate that a question is being asked.
I will say, that when using neural voices on type-recorder, I do not see the system differentiate the speech pattern much when I use an exclamation point as opposed to a period… at least it isn’t doing it yet. Hopefully, at least, you have a couple of tips to try as you create your own recordings!
What’s next?
I plan to create a separate blog post about this, but if you want to get really crazy, Microsoft Azure Cognitive Services text-to-speech system does allow you to distinguish the speaking style (i.e. News Cast, Cheerful, Empathetic) for certain neural voices. AWS allows the same thing, and I’m sure that Google and IBM probably do too. That said, they only allow this feature for a small subset of their neural voices, which tells me it is a newer feature that requires some heavy implementation on their part. I’m excited for more to come in the world of synthetic speech!
Thanks for reading!