If you didn’t read my post on Neural Voices and Punctuation, you can check it out here:
If you don’t want to read it, though, it’s cool. The gist is that Neural Voices are synthetic voices powered by (usually large) computer systems. These computer systems use Neural Voices to synthesize human-like (some might say “realistic”) speech leveraging Deep Learning and Artificial Intelligence. Computers are able to do this through speech pattern emulation. So, when generating speech from predefined text, the computer emulates speech patterns using contextual clues like words and punctuation to decide what sounds to make and tones to use when synthesizing the speech.
This article by Microsoft on text-to-speech synthesis in .NET breaks down the ideas of deconstructing speech and artificially reconstructing speech, and it highlights some of the concepts that I mention in my previous article on this subject. Some of those concepts include the idea that speech is composed of basic building blocks (phonemes), and that computers can “learn” to construct speech with phonemes by deconstructing large volumes of training data – speech samples with defined text values.
The amazing thing about these powerful cloud computing systems is that they can perform mind-boggling calculations at incredibly high speeds… they maybe can’t handle computations as complex as the ones tasked to the fabled Deep Thought, but we are headed in the right direction, I think. The computers of today, though, are fast enough to, for example, give us the fastest route from New York to Los Angeles in seconds. There are a plethora of possible routes that you could take to get from New York to Los Angeles. A computer scientist might even say that there are millions or more routes when you consider all of the possible turns that you could make at every intersection. Most of these routes would be inefficient, having you drive into Canada and/or Mexico and probably up and down the continental United States on the way. The point is that even after initially filtering out most of the poor options, there are so many potential routes that it would probably take a human a good few minutes if not longer to consider the options and pick the best one (especially if the human is considering things like construction and traffic). Knowing that a software program like Google Maps, when given two random points on a map can determine the fastest route between them in a matter of seconds is amazing!
I digress here because I got caught up thinking about my algorithms courses from college and the shortest path problem (Dijkstra’s algorithm anyone?). However, I also want to illustrate the point that there are a range of ways that a written sentence may be read aloud when considering things like prosody, tone, inflection, mood, and so on. There are multiple different “routes” one could go with reading the sentence (if you will allow that analogy). It is the job of the neural system to select the best possible “route” of speech to represent the text that is given.
In many cases, though, having the text isn’t enough to accurately read it aloud and express its desired message… Think of all the times you’ve probably misinterpreted a text 📱… What does she mean by “It’s fine.”? People very often need additional information to accurately read text aloud and properly express the desired emotion and message.
Set the Style
Hello, how is it going?Think of the different ways that this could be said.
Take the phrase above, “Hello. How is it going?” It’s simple, and you probably have already spoken it to yourself in your head. Yet, it, like many sentences, can be said in multiple ways. How would you say those sentences if you were talking to someone who had recently suffered a great loss? Or how might you say it to a stranger compared to a good friend?
Having the context of the situation allows you to more accurately express yourself, and for that reason, modern text-to-speech system developers are allowing inputs for “style” or disposition. The big players like Microsoft, Google, and Amazon are enhancing their Neural text-to-speech capabilities so that developers can specify a style of speech and so that, ultimately, the text-to-speech systems can become evermore realistic and applicable.
- Amazon Polly – NTTS Speaking Styles
- Azure Cognitive Services – Voice Styles – There are some cool samples in this post that illustrate the power of speaking styles – you can test out these and other samples on type-recorder.
- Expressive Speech Synthesis with Tacotron (Google) – This post gets into the idea of introducing style and various speaking factors in TTS.
Test it out
Want to get a look at how some of these styles compare? Check out the neural, female en-US-AriaNeural or zh-CN-XiaoxiaoNeural voices on type-recorder. You will be presented with a few style options to select from.
Let’s look at the phrase from earlier, “Hello. How is it going?” Using type-recorder, I’ve recorded the phrase with a “Cheerful” style and with an “Empathetic” style.
Hear the difference? The empathetic voice aligns more so with how one might start to console a grieving friend, while the cheerful voice sounds like how you might say hi to your friends when you go to meet up for a drink.
Pretty interesting to think about.
Many systems these days are actually smart enough to select a style of speaking based on your interactions with them. Try telling Google or Alexa that you are sad and see how it responds. Compare that to how it responds when you tell it you are happy.
Thoughts? Do you see these stylistic variances in those or other settings with text-to-speech or synthetic speech systems? Leave your comments below.
Implying the Style moving forward
If you decided to take a look at the styles available on type-recorder, you would have seen that there are a limited number of style options for a small set of voices (only two voices at the time of writing this article: en-US-AriaNeural and zh-CN-XiaoxiaoNeural). The blog post from Microsoft that I pointed at earlier is only just over a month old now. If you look at the Amazon documentation for Polly that I linked to earlier, you will see that they too only have a select set of styles available for a small group of voices. What does that mean? It means that these technologies are still well under development and only now being opened up to the general developer community. That, to me, is exciting because it means an opportunity for developers to build on this technology and innovate. More applications for this type of technology are within our grasp:
- automated chat agents,
- video game voices,
- automated announcements in public spaces,
- improved (more dynamically reactive) instructional tools,
- virtual companions/assistants,
- tools for the speech impaired,
- and plenty of other applications that I’m sure I haven’t thought about.
Tying it back
Tying speaking styles back to the construction and synthesis of Neural Voices, these styles add an additional layer to the mix. Neural text-to-speech systems are now required to differentiate between the various tones and patterns used to express different emotions and associate the appropriate patterns with the correct emotion/style on top of the original requirements to create neutral, natural sounding text output. It’s going to require more data, training, and “learning”, but it is going to happen. Soon enough, we will start to see these systems incorporating a greater range of the speaking styles and emotions that we humans employ everyday: anger, sarcasm, skepticism, nervousness, happiness, joy, elation and so on.
I’m not sure that there is truly a profound conclusion here beyond the fact that text to speech, as complex and advanced as it is, is continuing to evolve. As computers become more comfortable and less fatiguing to interact with, I believe that we will only continue to talk to computers like people to complete tasks, get information, or to simply have a conversation…. That is until Neuralink embeds computers into our minds and we only need to think about what we want… Maybe we will end up communicating without speaking…. Who knows!
Thanks for reading!