We’ve created this page as a living resource to supplement our research paper on TTS voice evaluation, published at the ACM Conference on Human Factors in Computing Systems (CHI) 2020.
Listening to long-form content is increasingly common, and the voice reading that content plays a significant role in shaping the user experience. In our work, we developed a method for evaluating the quality of voices for listening to long-form content. The analysis presented in the paper reports on a large-scale evaluation of 21 voices (18 TTS, 3 human). Below, you’ll find examples of those voice clips and additional information beyond what appeared in the paper. We may also expand our original analysis by including additional voices in the future, and will welcome community contributions using our method to evaluate voices as well.
Listen to the TTS voices
In our research study, participants listened to one of 21 voices reading an article Reduce Your Stress in Two Minutes a Day.
Here are brief clips of each of those voices reading the first two sentences of the article:
“Bill Rielly had it all: a degree from West Point, an executive position at Microsoft, strong faith, a great family life, and plenty of money. He even got along well with his in-laws!”
|Android UK Male|
What—more “objectively”—may have contributed to the observed differences in listeners’ preferences? While our focus is primarily on informing end users of TTS voices rather than speech synthesis engineers, we present this information as another feature that may help select synthesized voices, particularly for voices where the model details are known, but no listening test has been conducted.
One important word of caution: our TTS recordings were generated through end-user APIs and, without direct contact with the developers, it is difficult to identify the synthesis models or technical features of the voices with absolute certainty. Please note that (except where indicated), the synthesis approach listed is based on the best guess of experts in the speech synthesis field.
The F0 and Intensity values below were determined using Praat from the clips above in which each voice reads the first two sentences of the article (~10 second clips each).
|Voice Name||Average F0 (Hz)||Average Intensity (dB)||Synthesis model||Source|
|Judy GL1||188.7||76.5||Tacotron + Griffin Lim||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|Judy GL2||197.3||72.7||Tacotron2 + Griffin Lim||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|Judy W1||187.3||76.9||Tacotron + WaveRNN||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|Judy W2||195.5||78.0||Tacotron2 + WaveRNN||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|LJ Speech||215.4||73.4||Tacotron + GriffinLim||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|Nancy 1||197.7||75.2||Tacotron + Griffin Lim||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
|Nancy 2||189.0||75.9||Tacotron2 + WaveRNN||https://github.com/mozilla/TTS/wiki/Mean-Opinion-Score-Results|
Did we get something wrong? If you were involved in the development of any of these voices or notice an error, please let us know so we can correct it by filing an issue or submitting a pull request. We’d appreciate it!
Cite our work
BibTeX coming soon!