Categories
Technology type-recorder

Neural Voices and Proper Punctuation

Neural Voices

Neural voices.

Neural Voices (Using en-GB | Female | Neural)

Neural voices are all the rage in the text-to-speech world. Neural voices are what sophisticated text-to-speech systems use to synthesize natural sounding speech with computers. As the name implies, neural voices utilize neural networks to perform this synthesis. This article discusses neural voices and how punctuation plays a role in neural network speech synthesis.

Neural Networks

First, neural networks though.

I don’t want to spend too much time on them here, but neural networks, in short, are large computer systems… These systems enable “deep learning” and other forms of machine learning (where computer systems use exceptionally massive amounts of data to recognize and replicate patterns). With the right algorithms in place, feeding a deep learning system data allows the system to “learn,” and the more it learns, the smarter (better at recognizing and replicating patterns) it becomes.

If you are interested to learn more, here are some links to check out:

Back to Neural Voices

In the case of neural voices, neural networks replicate speech patterns and construct speech using phonemes, the most basic unit of speech, along with other building blocks like tone. These speech outputs by neural systems are able to sound much more like human speech because the neural computer systems emulate the same or similar patterns during speech synthesis that they “learn” from speech samples provided as part of learning/training data.

Can you explain that last sentence?

In an attempt to explain, let me explain. Basically (and I mean at a basic conceptual level), neural speech systems are fed large amounts of learning/training data: think of samples of speech paired with the corresponding string of text representing that speech. For example, the training data might include a key-value pair where the key is a string of text,

“Hey man, what’s going on?”

The value in that key-value pair is a recording of someone actually saying that phrase:

Hey man, what’s going on? (Using en-US | Female | Neural with Chat style)

After comparing millions if not billions if not trillions of pieces of training data like the key value pair above, the system is able to identify patterns of speech based on the text it is associated with. The neural system then builds speech on its own from a string of text using the building blocks at its disposal (phonemes, tones, etc.) attempting to use the pattern that it has identified from its training.

In reality, training models are much more complex, and trainers need to worry about handling bias, creating sustainable learning models, positive/negative reinforcement, and so on. This ai.googleblog article goes into some of these training concepts in its discussion of the neural networks behind Google Voice.

Finally, as is the case with most machine learning systems, designers and developers adjust the algorithms that the systems use to help the system achieve the desired result.

All the big players are doing it

Google, AWS, IBM, and Microsoft are all leveraging neural networks to create neural voices.

A couple of the articles above are some years old now, so this isn’t new news, but it still has yet to fully reach the mainstream. I know for a fact that as of writing this article and recently building type-recorder, the documentation for the APIs that I use are very recently developed – in fact, some of the SDK functions were not finished being built out yet! That means that this stuff (text-to-speech) is just now being made available to the broader audience of developers to experiment with and extrapolate on, and that is exciting!

So… Punctuation?

So… Punctuation. What about it? (Using: en-US | Male | Neural)
So Punctuation What about it (Using: en-US | Male | Neural)
So, punctuation. What about it? (Using: en-US | Male | Neural)

Can you hear the differences between the above three recordings (from type-recorder 😉😎 using the en-US | Male | Neural voice)? Compare how each recording sounds to the punctuation of the text it was synthesized from. Can you spot a difference between how the first recording sounds and the third one? Listen to how “So” is said at the beginning of both.

As deep learning has become and becomes more sophisticated, these complex neural systems utilize more features and building blocks to synthesize human-like speech including punctuation and even the words themselves. Take a look at the second recording (the one without punctuation); even without punctuation, the system knows that the word “what” is a question word, so the synthetic speaker raises the tone of its voice when “what” is present (as an English speaking human would do) at the end of the sentence to indicate that a question is being asked.

I will say, that when using neural voices on type-recorder, I do not see the system differentiate the speech pattern much when I use an exclamation point as opposed to a period… at least it isn’t doing it yet. Hopefully, at least, you have a couple of tips to try as you create your own recordings!

What’s next?

I plan to create a separate blog post about this, but if you want to get really crazy, Microsoft Azure Cognitive Services text-to-speech system does allow you to distinguish the speaking style (i.e. News Cast, Cheerful, Empathetic) for certain neural voices. AWS allows the same thing, and I’m sure that Google and IBM probably do too. That said, they only allow this feature for a small subset of their neural voices, which tells me it is a newer feature that requires some heavy implementation on their part. I’m excited for more to come in the world of synthetic speech!

Thanks for reading!

Thanks for reading! (Using: en-US | Female | Neural voice with a Cheerful style)
Categories
Development Technology type-recorder

Say it with Style

Hello.

tl;dr: Check out the Test it Out subsection of the Set the Style section of this article for some cool stuff that you can do with type-recorder.

Neural Voices

If you didn’t read my post on Neural Voices and Punctuation, you can check it out here:

Neural Voices and Punctuation

If you don’t want to read it, though, it’s cool. The gist is that Neural Voices are synthetic voices powered by (usually large) computer systems. These computer systems use Neural Voices to synthesize human-like (some might say “realistic”) speech leveraging Deep Learning and Artificial Intelligence. Computers are able to do this through speech pattern emulation. So, when generating speech from predefined text, the computer emulates speech patterns using contextual clues like words and punctuation to decide what sounds to make and tones to use when synthesizing the speech.

Neural Processing

This article by Microsoft on text-to-speech synthesis in .NET breaks down the ideas of deconstructing speech and artificially reconstructing speech, and it highlights some of the concepts that I mention in my previous article on this subject. Some of those concepts include the idea that speech is composed of basic building blocks (phonemes), and that computers can “learn” to construct speech with phonemes by deconstructing large volumes of training data – speech samples with defined text values.

The amazing thing about these powerful cloud computing systems is that they can perform mind-boggling calculations at incredibly high speeds… they maybe can’t handle computations as complex as the ones tasked to the fabled Deep Thought, but we are headed in the right direction, I think. The computers of today, though, are fast enough to, for example, give us the fastest route from New York to Los Angeles in seconds. There are a plethora of possible routes that you could take to get from New York to Los Angeles. A computer scientist might even say that there are millions or more routes when you consider all of the possible turns that you could make at every intersection. Most of these routes would be inefficient, having you drive into Canada and/or Mexico and probably up and down the continental United States on the way. The point is that even after initially filtering out most of the poor options, there are so many potential routes that it would probably take a human a good few minutes if not longer to consider the options and pick the best one (especially if the human is considering things like construction and traffic). Knowing that a software program like Google Maps, when given two random points on a map can determine the fastest route between them in a matter of seconds is amazing!

I digress here because I got caught up thinking about my algorithms courses from college and the shortest path problem (Dijkstra’s algorithm anyone?). However, I also want to illustrate the point that there are a range of ways that a written sentence may be read aloud when considering things like prosody, tone, inflection, mood, and so on. There are multiple different “routes” one could go with reading the sentence (if you will allow that analogy). It is the job of the neural system to select the best possible “route” of speech to represent the text that is given.

In many cases, though, having the text isn’t enough to accurately read it aloud and express its desired message… Think of all the times you’ve probably misinterpreted a text 📱… What does she mean by “It’s fine.”? People very often need additional information to accurately read text aloud and properly express the desired emotion and message.

Set the Style

Hello, how is it going?

Think of the different ways that this could be said.

Take the phrase above, “Hello. How is it going?” It’s simple, and you probably have already spoken it to yourself in your head. Yet, it, like many sentences, can be said in multiple ways. How would you say those sentences if you were talking to someone who had recently suffered a great loss? Or how might you say it to a stranger compared to a good friend?

Having the context of the situation allows you to more accurately express yourself, and for that reason, modern text-to-speech system developers are allowing inputs for “style” or disposition. The big players like Microsoft, Google, and Amazon are enhancing their Neural text-to-speech capabilities so that developers can specify a style of speech and so that, ultimately, the text-to-speech systems can become evermore realistic and applicable.

Test it out

Want to get a look at how some of these styles compare? Check out the neural, female en-US-AriaNeural or zh-CN-XiaoxiaoNeural voices on type-recorder. You will be presented with a few style options to select from.

See the Style select list that is available for en-US-AriaNeural and zh-CN-XiaoxiaoNeural voices.

Let’s look at the phrase from earlier, “Hello. How is it going?” Using type-recorder, I’ve recorded the phrase with a “Cheerful” style and with an “Empathetic” style.

“Hello. How is it going?” – en-US-AriaNeural | Female | Cheerful
“Hello. How is it going?” – en-US-AriaNeural | Female | Empathetic

Hear the difference? The empathetic voice aligns more so with how one might start to console a grieving friend, while the cheerful voice sounds like how you might say hi to your friends when you go to meet up for a drink.

Pretty interesting to think about.

Many systems these days are actually smart enough to select a style of speaking based on your interactions with them. Try telling Google or Alexa that you are sad and see how it responds. Compare that to how it responds when you tell it you are happy.

Thoughts? Do you see these stylistic variances in those or other settings with text-to-speech or synthetic speech systems? Leave your comments below.

Implying the Style moving forward

If you decided to take a look at the styles available on type-recorder, you would have seen that there are a limited number of style options for a small set of voices (only two voices at the time of writing this article: en-US-AriaNeural and zh-CN-XiaoxiaoNeural). The blog post from Microsoft that I pointed at earlier is only just over a month old now. If you look at the Amazon documentation for Polly that I linked to earlier, you will see that they too only have a select set of styles available for a small group of voices. What does that mean? It means that these technologies are still well under development and only now being opened up to the general developer community. That, to me, is exciting because it means an opportunity for developers to build on this technology and innovate. More applications for this type of technology are within our grasp:

  • automated chat agents,
  • video game voices,
  • automated announcements in public spaces,
  • improved (more dynamically reactive) instructional tools,
  • virtual companions/assistants,
  • tools for the speech impaired,
  • and plenty of other applications that I’m sure I haven’t thought about.

Tying it back

Tying speaking styles back to the construction and synthesis of Neural Voices, these styles add an additional layer to the mix. Neural text-to-speech systems are now required to differentiate between the various tones and patterns used to express different emotions and associate the appropriate patterns with the correct emotion/style on top of the original requirements to create neutral, natural sounding text output. It’s going to require more data, training, and “learning”, but it is going to happen. Soon enough, we will start to see these systems incorporating a greater range of the speaking styles and emotions that we humans employ everyday: anger, sarcasm, skepticism, nervousness, happiness, joy, elation and so on.

Conclusion

I’m not sure that there is truly a profound conclusion here beyond the fact that text to speech, as complex and advanced as it is, is continuing to evolve. As computers become more comfortable and less fatiguing to interact with, I believe that we will only continue to talk to computers like people to complete tasks, get information, or to simply have a conversation…. That is until Neuralink embeds computers into our minds and we only need to think about what we want… Maybe we will end up communicating without speaking…. Who knows!

Thanks for reading!