Choice of Voices

We’ve created this page as a living resource to supplement our research paper on TTS voice evaluation, published at the ACM Conference on Human Factors in Computing Systems (CHI) 2020.

Listening to long-form content is increasingly common, and the voice reading that content plays a significant role in shaping the user experience. In our work, we developed a method for evaluating the quality of voices for listening to long-form content. The analysis presented in the paper reports on a large-scale evaluation of 21 voices (18 TTS, 3 human). Below, you’ll find examples of those voice clips and additional information beyond what appeared in the paper. We may also expand our original analysis by including additional voices in the future, and will welcome community contributions using our method to evaluate voices as well.

Listen to the TTS voices

In our research study, participants listened to one of 21 voices reading an article Reduce Your Stress in Two Minutes a Day.

Here are brief clips of each of those voices reading the first two sentences of the article:

“Bill Rielly had it all: a degree from West Point, an executive position at Microsoft, strong faith, a great family life, and plenty of money. He even got along well with his in-laws!”

Voice Name	Clip
Android UK Male
Google A
Google C
Human 1
Human 2
Human 3
iOS
Judy GL1
Judy GL2
Judy W1
Judy W2
LJ Speech
Mac Default
Nancy 1
Nancy 2
Polly Joanna
Polly Matthew
Polly Sally
Voicery Nichole
Windows Zira
Windows David

Technical features

What—more “objectively”—may have contributed to the observed differences in listeners’ preferences? While our focus is primarily on informing end users of TTS voices rather than speech synthesis engineers, we present this information as another feature that may help select synthesized voices, particularly for voices where the model details are known, but no listening test has been conducted.

One important word of caution: our TTS recordings were generated through end-user APIs and, without direct contact with the developers, it is difficult to identify the synthesis models or technical features of the voices with absolute certainty. Please note that (except where indicated), the synthesis approach listed is based on the best guess of experts in the speech synthesis field.

The F0 and Intensity values below were determined using Praat from the clips above in which each voice reads the first two sentences of the article (~10 second clips each).

Voice Name	Average F0 (Hz)	Average Intensity (dB)	Synthesis model	Source
UK Male	116.8	67.1	TBD
Google A	133.2	74.7	WaveNet	https://cloud.google.com/text-to-speech/docs/wavenet
Google C	160.7	75.1	WaveNet	https://cloud.google.com/text-to-speech/docs/wavenet
Human 1	126.9	68.1	N/A	N/A
Human 2	186.7	71.9	N/A	N/A
Human 3	119.6	67.4	N/A	N/A
iOS	166.3	77.5	TBD
Judy GL1	188.7	76.5	Tacotron + Griffin Lim	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy GL2	197.3	72.7	Tacotron2 + Griffin Lim	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy W1	187.3	76.9	Tacotron + WaveRNN	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Judy W2	195.5	78.0	Tacotron2 + WaveRNN	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
LJ Speech	215.4	73.4	Tacotron + GriffinLim	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Mac Default	113.6	65.6	TBD
Nancy 1	197.7	75.2	Tacotron + Griffin Lim	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Nancy 2	189.0	75.9	Tacotron2 + WaveRNN	https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results
Polly Joanna	155.3	72.6	TBD
Polly Matthew	99.6	72.8	TBD
Polly Sally	192.2	73.1	TBD
Voicery Nichole	194.0	68.2	TBD
Windows Zira	176.9	66.1	TBD
Windows David	91.9	66.7	TBD

Did we get something wrong? If you were involved in the development of any of these voices or notice an error, please let us know so we can correct it by filing an issue or submitting a pull request. We’d appreciate it!

Cite our work

BibTeX coming soon!

Choice of Voices

A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

Listen to the TTS voices

Technical features

Cite our work