Version: 4.55.1

Text To Speech

This guide covers the bigger concepts about how Text To Speech works in Lens Studio and Snapchat. Check out theText To Speech Template for examples of using Text To Speech and 2D Animated TTS Template for examples of using Phoneme in Lens Studio!

Currently TTS supports US English with two voices, six different voice styles for both voices and the ability to tweak the pace of TTS speech playback. Phoneme Info supports 12 different mouth shapes for 2D Animation. With Auto Voice Style Selector, the voice style will change based on the context.

Text To Speech

Text To Speech Module

The main asset used for Text To Speech is Text To Speech Module. We can find it in the Resources panel. To add it into your Lens, in the Resources panel choose + -> Text To Speech Module.

Connect the Script

In the Resources panel, select + -> Script. And then let’s create an object in the scene. With our object selected, add a Script Component in the Inspector panel. Select + Add Component -> Script. Click the + Add Script field and select the script resource we just created.

In the script, to create a reference to Text To Speech Module and Audio Component:

// @input Asset.TextToSpeechModule tts {"label": "Text To Speech"}
// @input Component.AudioComponent audio

Audio Component

Next we'll add an Audio Component to play the synthesized audio from the Text to Speech module. Go back to the Inspector panel, then click on Add Component and select Audio Component.

TTS will generate TTS a AudioTrackAsset, which can be attached to an Audio Component as the Audio Track asset to play. For more information related to Audio Component, please check out Audio Component and AudioComponent API.

Attach to the script

Attach Text To Speech Module and Audio Component to the script to connect them to the Script we wrote.

Options

We use the option to configure Text To Speech. To create options:

var options = TextToSpeech.Options.create();

Voice Name

You can define the voice name with options. TTS supports two voice names: Sasha and Sam. The default voice will be TextToSpeech.VoiceNames.Sasha.

options.voiceName = TextToSpeech.VoiceNames.Sasha;

Voice Style

You can define the voice styles with options. TTS supports six voice styles for Sasha and Sam.

options.voiceStyle = TextToSpeech.VoiceStyles.One;

Automatic Voice Style Selector

You can also use the Automatic Voice Styles Selector. With Auto Style Selector, the voice style will change based on the context.

options.voiceStyle = TextToSpeech.VoiceStyles.Auto;

Voice Pace

You can define the voice pace with options. TTS supports playback speed:75=0.75X, 100=1X, 125=1.25X, 150=1.5X.

options.voicePace = 100;

Define functions for Text To Speech Callbacks

OnTTSCompleteHandler: will be called once the audio generation is completed, and receives two parameters: Audio Track Asset, WordInfos, PhonemeInfos and Voice Style.

var onTTSCompleteHandler = function(audioTrackAsset, wordInfos, phonemeInfos, voiceStyle) {
...
};

OnTTSErrorHandler: will be called if there is an error: receives a message of the error code and its description.

var onTTSErrorHandler = function (error, description) {
  print('Error: ' + error + ' Description: ' + description);
};

Generates speech (`AudioTrack Asset`) of a given text

var text = 'show me you love cats, without telling me you love cats!';
script.tts.synthesize(text, options, onTTSCompleteHandler, onTTSErrorHandler);

Text Input supports text in English only. Non English characters will be stripped.

Play TTS Audio

Once the audio generation is successfully completed, the OnTTSCompleteHandler will be called. We can get TTS Audio Track Asset. Then we can play the TTS Audio Track Asset with the Audio Component.

var onTTSCompleteHandler = function (
  audioTrackAsset,
  wordInfos,
  phonemeInfos,
  voiceStyle
) {
  print('TTS Success');
  script.audio.audioTrack = audioTrackAsset;
  script.audio.play(1);
};

Now save the script, and reset the Preview panel. We can then see the “TTS Success” in the logger, as well as hear the TTS Audio playing.

WordsInfos

In addition to the TTS Audio Track Asset, we can also get word infos for timing details for how the words are pronounced by the synthesized voice.

var onTTSCompleteHandler = function (
  audioTrackAsset,
  wordInfos,
  phonemeInfos,
  voiceStyle
) {
  print('TTS Success');
  script.audio.audioTrack = audioTrackAsset;
  script.audio.play(1);
  for (var i = 0; i < wordInfos.length; i++) {
    print(
      'word: ' +
        wordInfos[i].word +
        ', startTime: ' +
        wordInfos[i].startTime.toString() +
        ', endTime: ' +
        wordInfos[i].endTime.toString()
    );
  }
};

Now save the script, and reset the Preview panel. We can then see the word infos in the Logger panel.

The words the synthesized audio was generated for (as text might be expanded during the synthesize process, there might be a slight variation between the input text and the words returned).

The time information in the Start Time and the End Time is in milliseconds when the word started/ended in the audio.

PhonemeInfos

In the script, to create a reference to an Image Component.

// @input Component.Image image

In the Objects panel, click on the + and select Screen Image. Attach the Screen Image to the Script.

//@ui {"widget":"group_start", "label":"Animation Textures"}
// @input Asset.Texture neutralTexture
// @input Asset.Texture ahTexture
// @input Asset.Texture dTexture
// @input Asset.Texture eeTexture
// @input Asset.Texture fTexture
// @input Asset.Texture lTexture
// @input Asset.Texture mTexture
// @input Asset.Texture ohTexture
// @input Asset.Texture rTexture
// @input Asset.Texture sTexture
// @input Asset.Texture uhTexture
// @input Asset.Texture wOoTexture
//@ui {"widget":"group_end"}

Here are the 12 mouth shapes as a reference.

Next we attach different mouth shape textures to the texture fields. Currently Phoneme supports 12 different mouth shapes.

Let’s go back to the script to animate the textures based on phoneme info.

//Store textures to a texture array
var textures = [
  script.neutralTexture,
  script.wOoTexture,
  script.wOoTexture,
  script.dTexture,
  script.eeTexture,
  script.fTexture,
  script.lTexture,
  script.mTexture,
  script.ohTexture,
  script.rTexture,
  script.sTexture,
  script.wOoTexture,
  script.ahTexture,
  script.ahTexture,
  script.uhTexture,
  script.uhTexture,
];

//Store Phoneme Info
var timeline = [];

//TTS Audio State
var ttsStart = false;

//Current Phoneme Count
var currentPhonemeCount = 0;

//Map phoneme info to the texture index
var c2v = {
  '!': 'neutral',
  '?': 'neutral',
  '.': 'neutral',
  ',': 'neutral',
  ' ': 'neutral',
  '{@B}': 'm',
  '{@CH}': 's',
  '{@D}': 'd',
  '{@DH}': 'd',
  '{@DX}': 'oo1',
  '{@EL}': 'l',
  '{@EM}': 'm',
  '{@EN}': 'd',
  '{@F}': 'f',
  '{@G}': 'd',
  '{@HH}': 'e',
  '{@H}': 'oo',
  '{@JH}': 's',
  '{@K}': 'd',
  '{@L}': 'l',
  '{@M}': 'm',
  '{@N}': 'd',
  '{@NG}': 'd',
  '{@NX}': 'd',
  '{@P}': 'm',
  '{@Q}': 'd',
  '{@R}': 'r',
  '{@S}': 's',
  '{@SH}': 's',
  '{@T}': 'd',
  '{@TH}': 'l',
  '{@V}': 'f',
  '{@W}': 'o',
  '{@WH}': 'o',
  '{@Y}': 'l',
  '{@Z}': 's',
  '{@ZH}': 's',
  '{@AA0}': 'u1',
  '{@AE0}': 'e',
  '{@AH0}': 'u1',
  '{@AO0}': 'a1',
  '{@AW0}': 'o',
  '{@AX0}': 'oo1',
  '{@AXR0}': 'r',
  '{@AY0}': 'e',
  '{@EH0}': 'e',
  '{@ER0}': 'e',
  '{@EY0}': 'e',
  '{@IH0}': 'u1',
  '{@IX0}': 'e',
  '{@IY0}': 'u1',
  '{@OW0}': 'o',
  '{@OY0}': 'o',
  '{@UH0}': 'oo1',
  '{@UW0}': 'u1',
  '{@UX0}': 'u1',
  '{@AA1}': 'u1',
  '{@AE1}': 'e',
  '{@AH1}': 'u1',
  '{@AO1}': 'a1',
  '{@AW1}': 'o',
  '{@AX1}': 'oo1',
  '{@AXR1}': 'r',
  '{@AY1}': 'e',
  '{@EH1}': 'e',
  '{@ER1}': 'e',
  '{@EY1}': 'e',
  '{@IH1}': 'u1',
  '{@IX1}': 'e',
  '{@IY1}': 'u1',
  '{@OW1}': 'o',
  '{@OY1}': 'o',
  '{@UH1}': 'oo1',
  '{@UW1}': 'u1',
  '{@UX1}': 'u1',
  '{@AA2}': 'u1',
  '{@AE2}': 'e',
  '{@AH2}': 'u1',
  '{@AO2}': 'a1',
  '{@AW2}': 'o',
  '{@AX2}': 'oo1',
  '{@AXR2}': 'r',
  '{@AY2}': 'e',
  '{@EH2}': 'e',
  '{@ER2}': 'e',
  '{@EY2}': 'e',
  '{@IH2}': 'u1',
  '{@IX2}': 'e',
  '{@IY2}': 'u1',
  '{@OW2}': 'o',
  '{@OY2}': 'o',
  '{@UH2}': 'oo1',
  '{@UW2}': 'u1',
  '{@UX2}': 'u1',
};
var v2i = {
  neutral: 0,
  oo1: 1,
  oo2: 2,
  d: 3,
  e: 4,
  f: 5,
  l: 6,
  m: 7,
  o: 8,
  r: 9,
  s: 10,
  oo: 11,
  a1: 12,
  a2: 13,
  u1: 14,
  u2: 15,
};

var onTTSCompleteHandler = function (
  audioTrackAsset,
  wordInfos,
  phonemeInfos,
  voiceStyle
) {
  print('TTS Success');
  script.audio.audioTrack = audioTrackAsset;
  script.audio.play(1);
  for (var i = 0; i < wordInfos.length; i++) {
    print(
      'word: ' +
        wordInfos[i].word +
        ', startTime: ' +
        wordInfos[i].startTime.toString() +
        ', endTime: ' +
        wordInfos[i].endTime.toString()
    );
  }
  //Store Phoneme Infos to Timeline
  timeline = [];
  for (var i = 0; i < phonemeInfos.length; i++) {
    timeline[i] = {
      char: phonemeInfos[i].phoneme,
      startTime: phonemeInfos[i].startTime,
      endTime: phonemeInfos[i].endTime,
    };
  }
  //Set TTS Audio Start to Play State to be true
  ttsStart = true;

  //Reset Current Phoneme Count
  currentPhonemeCount = 0;
};

// Update Event triggered every frame.
script.createEvent('UpdateEvent').bind(function (eventData) {
  if (!ttsStart) {
    return;
  }
  // Current TTS Audio playback time in milliseconds
  var currentTime = script.audio.position * 1000;
  // If TTS Audio starts to play
  if (currentTime > 0) {
    // Loop through the timeline array, if the current playback time is between phoneme start and end time, set the texture to map the current phoneme, then move to the next phoneme
    for (var i = currentPhonemeCount; i < timeline.length; i++) {
      if (
        currentTime >= timeline[i].startTime &&
        currentTime <= timeline[i].endTime
      ) {
        print('-->' + timeline[i].char);
        var texture = 0;
        var currentChar = timeline[i].char;
        if (/[^a-zA-Z]/.test(currentChar) === -1) {
          texture = c2v[currentChar];
        } else {
          texture = v2i[c2v[currentChar]];
        }
        if (texture === undefined) {
          texture = 0;
        }
        script.image.mainPass.baseTex = textures[texture];
        currentPhonemeCount++;
        break;
      }
    }
  }
  // If the it is almost the end of TTS Audio, reset the texture to neutral texture and TTS Audio State
  if (currentTime >= timeline[timeline.length - 1].endTime - endTimeOffset) {
    script.image.mainPass.baseTex = textures[v2i['neutral']];
    ttsStart = false;
  }
});

Now save the script, and reset the Preview panel. We can then see the phoneme infos in the Logger panel as well as animated mouth.

Previewing Your Lens

You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.

Don’t forget to turn on the sound on your device!

Please refer to the guides below for additional information:

Was this page helpful?

Yes

Text To Speech​

Text To Speech Module​

Connect the Script​

Audio Component​

Attach to the script​

Options​

Voice Name​

Voice Style​

Automatic Voice Style Selector​

Voice Pace​

Define functions for Text To Speech Callbacks​

Generates speech (AudioTrack Asset) of a given text​

Play TTS Audio​

WordsInfos​

PhonemeInfos​​

Previewing Your Lens​

Related Guides​

Text To Speech

Text To Speech Module

Connect the Script

Audio Component

Attach to the script

Options

Voice Name

Voice Style

Automatic Voice Style Selector

Voice Pace

Define functions for Text To Speech Callbacks

Generates speech (`AudioTrack Asset`) of a given text

Play TTS Audio

WordsInfos

PhonemeInfos

Previewing Your Lens

Related Guides