Technology type-recorder

Neural Voices and Proper Punctuation

Leveraging Neural text-to-speech capabilities to generate life-like speech

Neural Voices

Neural voices.

Neural Voices (Using en-GB | Female | Neural)

Neural voices are all the rage in the text-to-speech world. Neural voices are what sophisticated text-to-speech systems use to synthesize natural sounding speech with computers. As the name implies, neural voices utilize neural networks to perform this synthesis. This article discusses neural voices and how punctuation plays a role in neural network speech synthesis.

Neural Networks

First, neural networks though.

I don’t want to spend too much time on them here, but neural networks, in short, are large computer systems… These systems enable “deep learning” and other forms of machine learning (where computer systems use exceptionally massive amounts of data to recognize and replicate patterns). With the right algorithms in place, feeding a deep learning system data allows the system to “learn,” and the more it learns, the smarter (better at recognizing and replicating patterns) it becomes.

If you are interested to learn more, here are some links to check out:

Back to Neural Voices

In the case of neural voices, neural networks replicate speech patterns and construct speech using phonemes, the most basic unit of speech, along with other building blocks like tone. These speech outputs by neural systems are able to sound much more like human speech because the neural computer systems emulate the same or similar patterns during speech synthesis that they “learn” from speech samples provided as part of learning/training data.

Can you explain that last sentence?

In an attempt to explain, let me explain. Basically (and I mean at a basic conceptual level), neural speech systems are fed large amounts of learning/training data: think of samples of speech paired with the corresponding string of text representing that speech. For example, the training data might include a key-value pair where the key is a string of text,

“Hey man, what’s going on?”

The value in that key-value pair is a recording of someone actually saying that phrase:

Hey man, what’s going on? (Using en-US | Female | Neural with Chat style)

After comparing millions if not billions if not trillions of pieces of training data like the key value pair above, the system is able to identify patterns of speech based on the text it is associated with. The neural system then builds speech on its own from a string of text using the building blocks at its disposal (phonemes, tones, etc.) attempting to use the pattern that it has identified from its training.

In reality, training models are much more complex, and trainers need to worry about handling bias, creating sustainable learning models, positive/negative reinforcement, and so on. This ai.googleblog article goes into some of these training concepts in its discussion of the neural networks behind Google Voice.

Finally, as is the case with most machine learning systems, designers and developers adjust the algorithms that the systems use to help the system achieve the desired result.

All the big players are doing it

Google, AWS, IBM, and Microsoft are all leveraging neural networks to create neural voices.

A couple of the articles above are some years old now, so this isn’t new news, but it still has yet to fully reach the mainstream. I know for a fact that as of writing this article and recently building type-recorder, the documentation for the APIs that I use are very recently developed – in fact, some of the SDK functions were not finished being built out yet! That means that this stuff (text-to-speech) is just now being made available to the broader audience of developers to experiment with and extrapolate on, and that is exciting!

So… Punctuation?

So… Punctuation. What about it? (Using: en-US | Male | Neural)
So Punctuation What about it (Using: en-US | Male | Neural)
So, punctuation. What about it? (Using: en-US | Male | Neural)

Can you hear the differences between the above three recordings (from type-recorder 😉😎 using the en-US | Male | Neural voice)? Compare how each recording sounds to the punctuation of the text it was synthesized from. Can you spot a difference between how the first recording sounds and the third one? Listen to how “So” is said at the beginning of both.

As deep learning has become and becomes more sophisticated, these complex neural systems utilize more features and building blocks to synthesize human-like speech including punctuation and even the words themselves. Take a look at the second recording (the one without punctuation); even without punctuation, the system knows that the word “what” is a question word, so the synthetic speaker raises the tone of its voice when “what” is present (as an English speaking human would do) at the end of the sentence to indicate that a question is being asked.

I will say, that when using neural voices on type-recorder, I do not see the system differentiate the speech pattern much when I use an exclamation point as opposed to a period… at least it isn’t doing it yet. Hopefully, at least, you have a couple of tips to try as you create your own recordings!

What’s next?

I plan to create a separate blog post about this, but if you want to get really crazy, Microsoft Azure Cognitive Services text-to-speech system does allow you to distinguish the speaking style (i.e. News Cast, Cheerful, Empathetic) for certain neural voices. AWS allows the same thing, and I’m sure that Google and IBM probably do too. That said, they only allow this feature for a small subset of their neural voices, which tells me it is a newer feature that requires some heavy implementation on their part. I’m excited for more to come in the world of synthetic speech!

Thanks for reading!

Thanks for reading! (Using: en-US | Female | Neural voice with a Cheerful style)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.