Prototyping Dialogue with Google Text-to-Speech

Prototyping scenes that rely on recorded voices is a challenging and time-consuming task. That's why we've automated the process using Google Text-to-Speech.

Prototyping Dialogue with Google Text-to-Speech

For a narrative-driven game like Arctic Awakening, it's really important for our team at GoldFire Studios to be able to quickly get a feel for the flow and pacing of a scene. We do this some time before we actually head into the recording studio with our voice actors, so we need some kind of placeholder to fill in for the real recorded dialogue.

Developers have a few options for placeholder dialogue, including timed subtitles and scratch audio recorded by programmers (the aural equivalent of "programmer art"). We tried both of these early on in development for Arctic Awakening before settling on a workflow using Google Cloud Speech, which has proved a significant time saver and given great results. Better still, our use case has fit within the product's free tier.

The Text-to-Speech product is a Cloud API which delivers great-sounding voice clips from the text strings you provide. You can pass other options besides the content itself, specifying a language, one of several presets for the character of the voice, and a gender. The language code allows for different accents as well, for instance American, British, Indian or Australian English.

We already had a database with the data we needed to get started (the line itself and the character who said it), so plugging that into Google's API was relatively quick and painless. Here's a snippet of our code in our dialogue management platform, StoryDB (which we'll talk more about in a later post):

// Configuration for which voice goes with which character.
// List of voices available here: https://cloud.google.com/text-to-speech/docs/voices
const voices = {
  Alfie: {languageCode: 'en-US', name: 'en-US-Standard-I', gender: 'MALE'},
  Kai: {languageCode: 'en-US', name: 'en-US-Wavenet-B', gender: 'MALE'},
  Donovan: {languageCode: 'en-US', name: 'en-US-Wavenet-J', gender: 'MALE'},
  ATC: {languageCode: 'en-US', name: 'en-US-Standard-G', gender: 'FEMALE'},
  default: {languageCode: 'en-US', name: 'en-US-Wavenet-F', gender: 'FEMALE'},
};

checkProjectAccess(req.session.uid)
  .then(() => fs.promises.mkdir(`static/clips/${projectId}`, {recursive: true}))
  .then(() => getLines(ids))
  .then(async(ls) => {
    const generateLine = async(l) => {
      const input = {text: l.caption};
      const voice = voices[l.character] || voices.default;
      const audioConfig = {audioEncoding: 'LINEAR16', speakingRate: 1.25};

      // Perform the text-to-speech request and write the audio content to file.
      const [response] = await textToSpeechClient.synthesizeSpeech({input, voice, audioConfig});
      const writeFile = util.promisify(fs.writeFile);
      await writeFile(`static/clips/${projectId}/${l.id}.wav`, response.audioContent, 'binary');
    };

    await Promise.allSettled(ls.map(generateLine));

    res.end();
  });
StoryDB's backend is written in JavaScript and runs on Node.js

And here's a sample of what gets generated:

From there, we jump into our game engine (Unity) and run a script which automates importing the line metadata and the audio clips. At that point, the clips themselves as well as the subtitles are ready to be used in a scene!

We're really happy with the results, and we'd definitely encourage other developers to give it a try if it sounds useful for your projects. If you have any questions, please drop a comment below and we'll be happy to answer to the best of our ability. We'll also be covering StoryDB in more detail in a future post.