Text To Speech
This guide covers the bigger concepts about how Text To Speech works in Lens Studio and Snapchat. Check out theText To Speech Template for examples of using Text To Speech and 2D Animated TTS Template for examples of using Phoneme in Lens Studio!
Currently TTS supports US English with two voices, six different voice styles for both voices and the ability to tweak the pace of TTS speech playback. Phoneme Info supports 12 different mouth shapes for 2D Animation. With Auto Voice Style Selector, the voice style will change based on the context.
Text To Speech
Text To Speech Module
The main asset used for Text To Speech is Text To Speech Module
. We can find it in the Asset Browser
panel. To add it into your Lens, in the Asset Browser
panel choose + -> Text To Speech Module
.
Connect the Script
In the Asset Browser
panel, select + -> Script
. And then let’s create an object in the scene. With our object selected, add a Script Component in the Inspector panel
. Select + Add Component
-> Script
. Click the + Add Script
field and select the script resource we just created.
In the script, to create a reference to Text To Speech Module
and Audio Component
:
// @input Asset.TextToSpeechModule tts {"label": "Text To Speech"}
// @input Component.AudioComponent audio
Audio Component
Next we'll add an Audio Component to play the synthesized audio from the Text to Speech
module. Go back to the Inspector panel, then click on Add Component and select Audio Component.
TTS will generate TTS a AudioTrackAsset
, which can be attached to an Audio Component
as the Audio Track asset to play. For more information related to Audio Component
, please check out Audio Component
and AudioComponent
API.
Attach to the script
Attach Text To Speech Module
and Audio Component
to the script to connect them to the Script we wrote.
Options
We use the option to configure Text To Speech. To create options:
var options = TextToSpeech.Options.create();
Voice Name
You can define the voice name with options. TTS supports two voice names: Sasha and Sam. The default voice will be TextToSpeech.VoiceNames.Sasha
.
options.voiceName = TextToSpeech.VoiceNames.Sasha;
Voice Style
You can define the voice styles with options. TTS supports six voice styles for Sasha and Sam.
options.voiceStyle = TextToSpeech.VoiceStyles.One;
Automatic Voice Style Selector
You can also use the Automatic Voice Styles Selector. With Auto Style Selector, the voice style will change based on the context.
options.voiceStyle = TextToSpeech.VoiceStyles.Auto;
Voice Pace
You can define the voice pace with options. TTS supports playback speed:75=0.75X, 100=1X, 125=1.25X, 150=1.5X.
options.voicePace = 100;
Define functions for Text To Speech Callbacks
OnTTSCompleteHandler
: will be called once the audio generation is completed, and receives two parameters: Audio Track Asset
, WordInfos
, PhonemeInfos
and Voice Style
.
var onTTSCompleteHandler = function(audioTrackAsset, wordInfos, phonemeInfos, voiceStyle) {
...
};
OnTTSErrorHandler
: will be called if there is an error: receives a message of the error code and its description.
var onTTSErrorHandler = function (error, description) {
print('Error: ' + error + ' Description: ' + description);
};
Generates speech (AudioTrack Asset
) of a given text
var text = 'show me you love cats, without telling me you love cats!';
script.tts.synthesize(text, options, onTTSCompleteHandler, onTTSErrorHandler);
Text Input supports text in English only. Non English characters will be stripped.
Play TTS Audio
Once the audio generation is successfully completed, the OnTTSCompleteHandler
will be called. We can get TTS Audio Track Asset
. Then we can play the TTS Audio Track Asset
with the Audio Component
.
var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
};
Now save the script, and reset the Preview panel. We can then see the “TTS Success” in the logger, as well as hear the TTS Audio playing.
WordsInfos
In addition to the TTS Audio Track Asset
, we can also get word infos for timing details for how the words are pronounced by the synthesized voice.
var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
for (var i = 0; i < wordInfos.length; i++) {
print(
'word: ' +
wordInfos[i].word +
', startTime: ' +
wordInfos[i].startTime.toString() +
', endTime: ' +
wordInfos[i].endTime.toString()
);
}
};
Now save the script, and reset the Preview panel. We can then see the word infos in the Logger panel.
The words the synthesized audio was generated for (as text might be expanded during the synthesize process, there might be a slight variation between the input text and the words returned).
The time information in the Start Time
and the End Time
is in milliseconds when the word started/ended in the audio.
PhonemeInfos
In the script, to create a reference to an Image Component.
// @input Component.Image image
In the Scene Hierarchy
panel, click on the + and select Screen Image. Attach the Screen Image
to the Script.
//@ui {"widget":"group_start", "label":"Animation Textures"}
// @input Asset.Texture neutralTexture
// @input Asset.Texture ahTexture
// @input Asset.Texture dTexture
// @input Asset.Texture eeTexture
// @input Asset.Texture fTexture
// @input Asset.Texture lTexture
// @input Asset.Texture mTexture
// @input Asset.Texture ohTexture
// @input Asset.Texture rTexture
// @input Asset.Texture sTexture
// @input Asset.Texture uhTexture
// @input Asset.Texture wOoTexture
//@ui {"widget":"group_end"}
Here are the 12 mouth shapes as a reference.
Next we attach different mouth shape textures to the texture
fields. Currently Phoneme supports 12 different mouth shapes.
Let’s go back to the script to animate the textures based on phoneme info.
//Store textures to a texture array
var textures = [
script.neutralTexture,
script.wOoTexture,
script.wOoTexture,
script.dTexture,
script.eeTexture,
script.fTexture,
script.lTexture,
script.mTexture,
script.ohTexture,
script.rTexture,
script.sTexture,
script.wOoTexture,
script.ahTexture,
script.ahTexture,
script.uhTexture,
script.uhTexture,
];
//Store Phoneme Info
var timeline = [];
//TTS Audio State
var ttsStart = false;
//Current Phoneme Count
var currentPhonemeCount = 0;
//Map phoneme info to the texture index
var c2v = {
'!': 'neutral',
'?': 'neutral',
'.': 'neutral',
',': 'neutral',
' ': 'neutral',
'{@B}': 'm',
'{@CH}': 's',
'{@D}': 'd',
'{@DH}': 'd',
'{@DX}': 'oo1',
'{@EL}': 'l',
'{@EM}': 'm',
'{@EN}': 'd',
'{@F}': 'f',
'{@G}': 'd',
'{@HH}': 'e',
'{@H}': 'oo',
'{@JH}': 's',
'{@K}': 'd',
'{@L}': 'l',
'{@M}': 'm',
'{@N}': 'd',
'{@NG}': 'd',
'{@NX}': 'd',
'{@P}': 'm',
'{@Q}': 'd',
'{@R}': 'r',
'{@S}': 's',
'{@SH}': 's',
'{@T}': 'd',
'{@TH}': 'l',
'{@V}': 'f',
'{@W}': 'o',
'{@WH}': 'o',
'{@Y}': 'l',
'{@Z}': 's',
'{@ZH}': 's',
'{@AA0}': 'u1',
'{@AE0}': 'e',
'{@AH0}': 'u1',
'{@AO0}': 'a1',
'{@AW0}': 'o',
'{@AX0}': 'oo1',
'{@AXR0}': 'r',
'{@AY0}': 'e',
'{@EH0}': 'e',
'{@ER0}': 'e',
'{@EY0}': 'e',
'{@IH0}': 'u1',
'{@IX0}': 'e',
'{@IY0}': 'u1',
'{@OW0}': 'o',
'{@OY0}': 'o',
'{@UH0}': 'oo1',
'{@UW0}': 'u1',
'{@UX0}': 'u1',
'{@AA1}': 'u1',
'{@AE1}': 'e',
'{@AH1}': 'u1',
'{@AO1}': 'a1',
'{@AW1}': 'o',
'{@AX1}': 'oo1',
'{@AXR1}': 'r',
'{@AY1}': 'e',
'{@EH1}': 'e',
'{@ER1}': 'e',
'{@EY1}': 'e',
'{@IH1}': 'u1',
'{@IX1}': 'e',
'{@IY1}': 'u1',
'{@OW1}': 'o',
'{@OY1}': 'o',
'{@UH1}': 'oo1',
'{@UW1}': 'u1',
'{@UX1}': 'u1',
'{@AA2}': 'u1',
'{@AE2}': 'e',
'{@AH2}': 'u1',
'{@AO2}': 'a1',
'{@AW2}': 'o',
'{@AX2}': 'oo1',
'{@AXR2}': 'r',
'{@AY2}': 'e',
'{@EH2}': 'e',
'{@ER2}': 'e',
'{@EY2}': 'e',
'{@IH2}': 'u1',
'{@IX2}': 'e',
'{@IY2}': 'u1',
'{@OW2}': 'o',
'{@OY2}': 'o',
'{@UH2}': 'oo1',
'{@UW2}': 'u1',
'{@UX2}': 'u1',
};
var v2i = {
neutral: 0,
oo1: 1,
oo2: 2,
d: 3,
e: 4,
f: 5,
l: 6,
m: 7,
o: 8,
r: 9,
s: 10,
oo: 11,
a1: 12,
a2: 13,
u1: 14,
u2: 15,
};
var onTTSCompleteHandler = function (
audioTrackAsset,
wordInfos,
phonemeInfos,
voiceStyle
) {
print('TTS Success');
script.audio.audioTrack = audioTrackAsset;
script.audio.play(1);
for (var i = 0; i < wordInfos.length; i++) {
print(
'word: ' +
wordInfos[i].word +
', startTime: ' +
wordInfos[i].startTime.toString() +
', endTime: ' +
wordInfos[i].endTime.toString()
);
}
//Store Phoneme Infos to Timeline
timeline = [];
for (var i = 0; i < phonemeInfos.length; i++) {
timeline[i] = {
char: phonemeInfos[i].phoneme,
startTime: phonemeInfos[i].startTime,
endTime: phonemeInfos[i].endTime,
};
}
//Set TTS Audio Start to Play State to be true
ttsStart = true;
//Reset Current Phoneme Count
currentPhonemeCount = 0;
};
// Update Event triggered every frame.
script.createEvent('UpdateEvent').bind(function (eventData) {
if (!ttsStart) {
return;
}
// Current TTS Audio playback time in milliseconds
var currentTime = script.audio.position * 1000;
// If TTS Audio starts to play
if (currentTime > 0) {
// Loop through the timeline array, if the current playback time is between phoneme start and end time, set the texture to map the current phoneme, then move to the next phoneme
for (var i = currentPhonemeCount; i < timeline.length; i++) {
if (
currentTime >= timeline[i].startTime &&
currentTime <= timeline[i].endTime
) {
print('-->' + timeline[i].char);
var texture = 0;
var currentChar = timeline[i].char;
if (/[^a-zA-Z]/.test(currentChar) === -1) {
texture = c2v[currentChar];
} else {
texture = v2i[c2v[currentChar]];
}
if (texture === undefined) {
texture = 0;
}
script.image.mainPass.baseTex = textures[texture];
currentPhonemeCount++;
break;
}
}
}
// If the it is almost the end of TTS Audio, reset the texture to neutral texture and TTS Audio State
if (currentTime >= timeline[timeline.length - 1].endTime - endTimeOffset) {
script.image.mainPass.baseTex = textures[v2i['neutral']];
ttsStart = false;
}
});
Now save the script, and reset the Preview panel. We can then see the phoneme infos in the Logger panel as well as animated mouth.
Previewing Your Lens
You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.
Don’t forget to turn on the sound on your device!