Skip to main content
Version: 5.x
Supported on
Snapchat
Spectacles

Text To Speech

This guide covers the bigger concepts about how Text To Speech (TTS) works in Lens Studio and Snapchat.

Currently TTS supports US English with two voices, six different voice styles for both voices and the ability to tweak the pace of TTS speech playback. Phoneme Info supports 12 different mouth shapes for 2D Animation. With Auto Voice Style Selector, the voice style will change based on the context.

This guide will teach you how to set up an example from scratch.

Check out a ready-to-go Text To Speech Example in Lens Studio Asset Library. Import the asset to your project, create new Orthographic camera and place the prefab under it.

Also try the 2D Animated TTS sample project for examples of using Phoneme in Lens Studio! It is available on Lens Studio Home Page.

Text To Speech

Scripting

In the Asset Browser panel, select + -> JavaScript File. Drag the script from Asset Browser to Scene Hierarchy to create scene object with script attached.

Double click on script to open for editing and add next lines to reference Text To Speech Module and Audio Component:

const TTS = require('LensStudio:TextToSpeechModule');

Audio Component

Next let's add an Audio Component to play the synthesized audio from the Text to Speech Module.

  • To Go back to the Inspector panel, then click on Add Component and select Audio Component.

  • Add new script input to reference this component.

    // @input Component.AudioComponent audio
  • Set script input with audio component.

TTS will generate an AudioTrackAsset, which can be played using Audio Component we just created.

Options

We use the option object to configure Text To Speech Module. To create options:

var options = TextToSpeech.Options.create();

Voice Name

You can define the voice name with options. TTS supports two voices: Sasha and Sam. The default voice will be  TextToSpeech.VoiceNames.Sasha.

options.voiceName = TextToSpeech.VoiceNames.Sasha;

Voice Style

You can define the voice styles with options. TTS supports six voice styles for Sasha and Sam.

options.voiceStyle = TextToSpeech.VoiceStyles.One;

Automatic Voice Style Selector

You can also use the Automatic Voice Styles Selector. With Auto Style Selector, the voice style will change based on the context.

options.voiceStyle = TextToSpeech.VoiceStyles.Auto;

Voice Pace

You can define the voice pace with options. TTS supports playback speed: 75 = 0.75X, 100 = 1X, 125 = 1.25X, 150 = 1.5X.

options.voicePace = 100;

Callbacks

Let's define functions to handle successful or unsuccessful audio synthesis.

OnTTSCompleteHandler: will be called once the audio generation is completed, and receives two parameters: Audio Track Asset, WordInfos, PhonemeInfos and Voice Style.

var onTTSCompleteHandler = function(audioTrackAsset, wordInfos, phonemeInfos, voiceStyle) {
...
};

OnTTSErrorHandler: will be called if there is an error: receives a message of the error code and its description.

var onTTSErrorHandler = function (error, description) {
print('Error: ' + error + ' Description: ' + description);
};

Generate AudioTrack Asset

var text = 'show me you love cats, without telling me you love cats!';
TTS.synthesize(text, options, onTTSCompleteHandler, onTTSErrorHandler);

Text Input supports text in English only. Non-English characters will be stripped.

Play Audio

Once the audio generation is successfully completed, the OnTTSCompleteHandler will be called. We can get TTS Audio Track Asset. Then we can play the TTS Audio Track Asset with the Audio Component.

var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
};

Now save the script, and reset the Preview panel. We can then see the “TTS Success” in the logger, as well as hear the TTS Audio playing.

Word Infos

In addition to the TTS Audio Track Asset, we can also get word infos for timing details for how the words are pronounced by the synthesized voice.

var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
for (var i = 0; i < wordInfos.length; i++) {
print(
'word: ' +
wordInfos[i].word +
', startTime: ' +
wordInfos[i].startTime.toString() +
', endTime: ' +
wordInfos[i].endTime.toString()
);
}
};

Now save the script, and reset the Preview panel. We can then see the word infos in the Logger panel.

The words the synthesized audio was generated for (as text might be expanded during the synthesize process, there might be a slight variation between the input text and the words returned).

The time information in the Start Time and the End Time is in milliseconds when the word started/ended in the audio.

Phoneme Infos​

The TTS module allows you to obtain additional information about phonemes, enabling the creation of talking animations. Currently Phoneme supports 12 different mouth shapes. Let's set up a simple talking animation using a set of images. To do this:

  • In the script add an Image Component input:

    // @input Component.Image image
  • Add new Screen Image in the Scene Hierarchy panel and set an Image input of our script compomnent:

  • Add Texture inputs to the script:

    // @ui {"widget":"group_start", "label":"Animation Textures"}
    // @input Asset.Texture neutralTexture
    // @input Asset.Texture ahTexture
    // @input Asset.Texture dTexture
    // @input Asset.Texture eeTexture
    // @input Asset.Texture fTexture
    // @input Asset.Texture lTexture
    // @input Asset.Texture mTexture
    // @input Asset.Texture ohTexture
    // @input Asset.Texture rTexture
    // @input Asset.Texture sTexture
    // @input Asset.Texture uhTexture
    // @input Asset.Texture wOoTexture
    // @ui {"widget":"group_end"}

    Prepare 12 mouth shape texture files, here is a reference that we'll be using:

  • Import textures to Lens Studio and set script inputs with corresponding mouth shape textures.

Let’s go back to the script to animate the textures based on phoneme info. Copy the next lines of code and add it to your script:

Click to expand the code
//Store textures to a texture array
var textures = [
script.neutralTexture,
script.wOoTexture,
script.wOoTexture,
script.dTexture,
script.eeTexture,
script.fTexture,
script.lTexture,
script.mTexture,
script.ohTexture,
script.rTexture,
script.sTexture,
script.wOoTexture,
script.ahTexture,
script.ahTexture,
script.uhTexture,
script.uhTexture,
];

//Store Phoneme Info
var timeline = [];

//TTS Audio State
var ttsStart = false;

//Current Phoneme Count
var currentPhonemeCount = 0;

//Map phoneme info to the texture index
var c2v = {
'!': 'neutral',
'?': 'neutral',
'.': 'neutral',
',': 'neutral',
' ': 'neutral',
'{@B}': 'm',
'{@CH}': 's',
'{@D}': 'd',
'{@DH}': 'd',
'{@DX}': 'oo1',
'{@EL}': 'l',
'{@EM}': 'm',
'{@EN}': 'd',
'{@F}': 'f',
'{@G}': 'd',
'{@HH}': 'e',
'{@H}': 'oo',
'{@JH}': 's',
'{@K}': 'd',
'{@L}': 'l',
'{@M}': 'm',
'{@N}': 'd',
'{@NG}': 'd',
'{@NX}': 'd',
'{@P}': 'm',
'{@Q}': 'd',
'{@R}': 'r',
'{@S}': 's',
'{@SH}': 's',
'{@T}': 'd',
'{@TH}': 'l',
'{@V}': 'f',
'{@W}': 'o',
'{@WH}': 'o',
'{@Y}': 'l',
'{@Z}': 's',
'{@ZH}': 's',
'{@AA0}': 'u1',
'{@AE0}': 'e',
'{@AH0}': 'u1',
'{@AO0}': 'a1',
'{@AW0}': 'o',
'{@AX0}': 'oo1',
'{@AXR0}': 'r',
'{@AY0}': 'e',
'{@EH0}': 'e',
'{@ER0}': 'e',
'{@EY0}': 'e',
'{@IH0}': 'u1',
'{@IX0}': 'e',
'{@IY0}': 'u1',
'{@OW0}': 'o',
'{@OY0}': 'o',
'{@UH0}': 'oo1',
'{@UW0}': 'u1',
'{@UX0}': 'u1',
'{@AA1}': 'u1',
'{@AE1}': 'e',
'{@AH1}': 'u1',
'{@AO1}': 'a1',
'{@AW1}': 'o',
'{@AX1}': 'oo1',
'{@AXR1}': 'r',
'{@AY1}': 'e',
'{@EH1}': 'e',
'{@ER1}': 'e',
'{@EY1}': 'e',
'{@IH1}': 'u1',
'{@IX1}': 'e',
'{@IY1}': 'u1',
'{@OW1}': 'o',
'{@OY1}': 'o',
'{@UH1}': 'oo1',
'{@UW1}': 'u1',
'{@UX1}': 'u1',
'{@AA2}': 'u1',
'{@AE2}': 'e',
'{@AH2}': 'u1',
'{@AO2}': 'a1',
'{@AW2}': 'o',
'{@AX2}': 'oo1',
'{@AXR2}': 'r',
'{@AY2}': 'e',
'{@EH2}': 'e',
'{@ER2}': 'e',
'{@EY2}': 'e',
'{@IH2}': 'u1',
'{@IX2}': 'e',
'{@IY2}': 'u1',
'{@OW2}': 'o',
'{@OY2}': 'o',
'{@UH2}': 'oo1',
'{@UW2}': 'u1',
'{@UX2}': 'u1',
};
var v2i = {
neutral: 0,
oo1: 1,
oo2: 2,
d: 3,
e: 4,
f: 5,
l: 6,
m: 7,
o: 8,
r: 9,
s: 10,
oo: 11,
a1: 12,
a2: 13,
u1: 14,
u2: 15,
};

var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
for (var i = 0; i < wordInfos.length; i++) {
print(
'word: ' +
wordInfos[i].word +
', startTime: ' +
wordInfos[i].startTime.toString() +
', endTime: ' +
wordInfos[i].endTime.toString()
);
}
//Store Phoneme Infos to Timeline
timeline = [];
for (var i = 0; i < phonemeInfos.length; i++) {
timeline[i] = {
char: phonemeInfos[i].phoneme,
startTime: phonemeInfos[i].startTime,
endTime: phonemeInfos[i].endTime,
};
}
//Set TTS Audio Start to Play State to be true
ttsStart = true;

//Reset Current Phoneme Count
currentPhonemeCount = 0;
};

// Update Event triggered every frame.
script.createEvent('UpdateEvent').bind(function (eventData) {
if (!ttsStart) {
return;
}
// Current TTS Audio playback time in milliseconds
var currentTime = script.audio.position \* 1000;
// If TTS Audio starts to play
if (currentTime > 0) {
// Loop through the timeline array, if the current playback time is between phoneme start and end time, set the texture to map the current phoneme, then move to the next phoneme
for (var i = currentPhonemeCount; i < timeline.length; i++) {
if (
currentTime >= timeline[i].startTime &&
currentTime <= timeline[i].endTime
) {
print('-->' + timeline[i].char);
var texture = 0;
var currentChar = timeline[i].char;
if (/[^a-zA-Z]/.test(currentChar) === -1) {
texture = c2v[currentChar];
} else {
texture = v2i[c2v[currentChar]];
}
if (texture === undefined) {
texture = 0;
}
script.image.mainPass.baseTex = textures[texture];
currentPhonemeCount++;
break;
}
}
}
// If the it is almost the end of TTS Audio, reset the texture to neutral texture and TTS Audio State
if (currentTime >= timeline[timeline.length - 1].endTime - endTimeOffset) {
script.image.mainPass.baseTex = textures[v2i['neutral']];
ttsStart = false;
}
});

Now save the script, and reset the Preview panel. We can then see the phoneme infos in the Logger panel as well as animated mouth.

Please check out Animated TTS example!

Previewing Your Lens

You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.

Don’t forget to turn on the sound on your device!

Was this page helpful?
Yes
No