We’ve created this page as a living resource to supplement our research paper on TTS voice evaluation, published at the ACM Conference on Human Factors in Computing Systems (CHI) 2020.

Listening to long-form content is increasingly common, and the voice reading that content plays a significant role in shaping the user experience. In our work, we developed a method for evaluating the quality of voices for listening to long-form content. The analysis presented in the paper reports on a large-scale evaluation of 21 voices (18 TTS, 3 human). Below, you’ll find examples of those voice clips and additional information beyond what appeared in the paper. We may also expand our original analysis by including additional voices in the future, and will welcome community contributions using our method to evaluate voices as well.

Listen to the TTS voices

In our research study, participants listened to one of 21 voices reading an article Reduce Your Stress in Two Minutes a Day.

Here are brief clips of each of those voices reading the first two sentences of the article:

“Bill Rielly had it all: a degree from West Point, an executive position at Microsoft, strong faith, a great family life, and plenty of money. He even got along well with his in-laws!”

Voice Name Clip
Android UK Male
Google A
Google C
Human 1
Human 2
Human 3
Judy GL1
Judy GL2
Judy W1
Judy W2
LJ Speech
Mac Default
Nancy 1
Nancy 2
Polly Joanna
Polly Matthew
Polly Sally
Voicery Nichole
Windows Zira
Windows David

Technical features

What—more “objectively”—may have contributed to the observed differences in listeners’ preferences? While our focus is primarily on informing end users of TTS voices rather than speech synthesis engineers, we present this information as another feature that may help select synthesized voices, particularly for voices where the model details are known, but no listening test has been conducted.

One important word of caution: our TTS recordings were generated through end-user APIs and, without direct contact with the developers, it is difficult to identify the synthesis models or technical features of the voices with absolute certainty. Please note that (except where indicated), the synthesis approach listed is based on the best guess of experts in the speech synthesis field.

The F0 and Intensity values below were determined using Praat from the clips above in which each voice reads the first two sentences of the article (~10 second clips each).

Voice Name Average F0 (Hz) Average Intensity (dB) Synthesis model Source
UK Male 116.8 67.1 TBD  
Google A 133.2 74.7 WaveNet https://cloud.google.com/text-to-speech/docs/wavenet
Google C 160.7 75.1 WaveNet https://cloud.google.com/text-to-speech/docs/wavenet
Human 1 126.9 68.1 N/A N/A
Human 2 186.7 71.9 N/A N/A
Human 3 119.6 67.4 N/A N/A
iOS 166.3 77.5 TBD  
Judy GL1 188.7 76.5 Tacotron + Griffin Lim https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy GL2 197.3 72.7 Tacotron2 + Griffin Lim https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy W1 187.3 76.9 Tacotron + WaveRNN https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy W2 195.5 78.0 Tacotron2 + WaveRNN https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
LJ Speech 215.4 73.4 Tacotron + GriffinLim https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Mac Default 113.6 65.6 TBD  
Nancy 1 197.7 75.2 Tacotron + Griffin Lim https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Nancy 2 189.0 75.9 Tacotron2 + WaveRNN https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Polly Joanna 155.3 72.6 TBD  
Polly Matthew 99.6 72.8 TBD  
Polly Sally 192.2 73.1 TBD  
Voicery Nichole 194.0 68.2 TBD  
Windows Zira 176.9 66.1 TBD  
Windows David 91.9 66.7 TBD  

Did we get something wrong? If you were involved in the development of any of these voices or notice an error, please let us know so we can correct it by filing an issue or submitting a pull request. We’d appreciate it!

Cite our work

BibTeX coming soon!