Speech Recognition
In this guide, we will go over some of the bigger concepts that underlie how Speech Recognition works in Lens Studio and Snapchat. This guide is script heavy. But don’t worry, you can also take a look at the Speech Recognition Template, Voice UI Template, and Sentiment Analyzer.
VoiceML
VoiceML allows you to incorporate transcription and keyword detection as well as voice command detection based on basic natural language understanding into the Lenses. These capabilities can serve as a trigger for a visual effect or any other behavior.
- Transcription: transcribes speech and returns a transcript. This can be done in real time (Pre-Capture) or during recording.
- Keyword Detection: allows users to define a list of keywords and detects keywords based on Keyword Classifier.
- Voice Navigation Command Detection: detects a predefined list of voice navigation commands based on basic natural language processing
- Sentiment Analyzer: analysis sentiment based on basic natural language processing
- Transcription and keyword detection are now available for the English, Spanish, French and German languages. Voice Navigation Command Detection is only available for the English language. Transcription limitations include for example new names for things, slang words or acute accents.
- Do not play sound or speech from the Lens while activating the microphone to capture sound.
- Try to avoid background noise and far device distance while activating the microphone to capture sound.
- If the microphone is muted for more than two minutes, the transcription won't continue after unmuting, and you’ll need to reset the
Preview
panel to enable it. - If you were previously logged on to MyLenses and are having trouble seeing the preview in Lens Studio, logout from MyLenses and login to MyLenses again.
Here is how to logout and login to MyLenses.
Overview
This guide will walk through the usage of the VoiceML module. You can find the complete script that utilizes transcription, keyword detection, voice navigation command detection as well as sentiment analysis at the bottom of this page.
VoiceML Module
The main asset used for VoiceML is VoiceML Module
. Add VoiceML Module
from Resource panel
. In the Asset Browser
panel, select + ->
VoiceML Module
. With VoiceML Module
we can configure all the settings for VoiceML.
In the bottom of the Preview panel
, click the microphone. Test with your voice to see the blue vertical volume meter in action to ensure you are not muted.
Connect the Script
In the Asset Browser
panel, select + ->
Script
. And then let’s create an object in the scene. With our object selected, add a Script Component in the Inspector panel. Select + Add Component
-> Script
. Click the + Add Script
field and select the script resource we just created.
In the script, to create a reference to VoiceMLModule
:
//@input Asset.VoiceMLModule vmlModule {"label": "Voice ML Module"}
Options
We use the option to configure Voice ML Module
. To create options:
var options = VoiceML.ListeningOptions.create();
Define functions for Voice Callbacks
onListeningEnabledHandler
: will get called when the microphone is enabled, We recommend to you to use the StartListening method here. Calling the StartListening method before the microphone was enabled will result in an error.
When previewing your project in Lens Studio, this would be called as you press the icon described above. In phone version, tap the screen to trigger this function.
var onListeningEnabledHandler = function () {
script.vmlModule.startListening(options);
};
onListeningDisabledHandler
: will get called when the microphone is disabled.
var onListeningDisabledHandler = function () {
script.vmlModule.stopListening();
};
onListeningErrorHandler
: will get called when there is error in the transcription.
var onListeningErrorHandler = function (eventErrorArgs) {
print(
'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
);
};
onUpdateListeningEventHandler
: will get called when transcription is successful.
var onUpdateListeningEventHandler = function(eventArgs) {
...
};
Add callbacks to VoiceML Module
:
//VoiceML Callbacks
script.vmlModule.onListeningUpdate.add(onUpdateListeningEventHandler);
script.vmlModule.onListeningError.add(onListeningErrorHandler);
script.vmlModule.onListeningEnabled.add(onListeningEnabledHandler);
script.vmlModule.onListeningDisabled.add(onListeningDisabledHandler);
In the Preview panel
, the microphone button is not simulated on the screen. When we reset the preview, once the VoiceML is initialized, On Listening Enabled
will be triggered automatically. To learn how to use the microphone button, try to preview the Lens in Snapchat. Please check the details in Previewing Your Lens.
Transcription
With VoiceML, we can recognize speech, transcribe it to text and get a transcription. Right after we create options, we need to define the basic settings for transcription.
Basic Settings for Transcription
SpeechRecognizer:
options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;
Language settings for English, Spanish, French and German:
options.languageCode = 'en_US';
options.languageCode = 'es_MX';
options.languageCode = 'fr_FR';
options.languageCode = 'de_DE';
To enable transcription:
//General Option for Transcription
options.shouldReturnAsrTranscription = true;
Here we can also get live but less accurate transcription before we get the final transcription.To enable this setting:
options.shouldReturnInterimAsrTranscription = true;
We can then get the transcription result in the onUpdateListeningEventHandler
:
var onUpdateListeningEventHandler = function (eventArgs) {
if (eventArgs.transcription.trim() == '') {
return;
}
print('Transcription: ' + eventArgs.transcription);
if (!eventArgs.isFinalTranscription) {
return;
}
print('Final Transcription: ' + eventArgs.transcription);
};
Reset Preview. Speak with the Microphone button enabled in the Preview panel and see the result in the Logger.
VoiceML can transcribe Standard English words. Its limitations include, for example, new names for things or slang words. Here is the current language model dictionary.
Speech Context
We can also add speech contexts to the transcription and boost some of the words for specific transcription scenarios. Use this when transcribing words which are rarer and aren’t picked up well enough by VoiceML, the higher the boost value will be, the more likely the word to appear in transcription.
To add speech context:
//Speech Context
var phrasesOne = ["yellow","blue", ...];
var boostValueOne = 5;
var phrasesTwo = ["black", ...];
var boostValueTwo = 5;
options.addSpeechContext(phrasesOne,boostValueOne);
options.addSpeechContext(phrasesTwo,boostValueTwo);
...
The range for boost value is from 1-10
, we recommend you’ll start with 5 and adjust if needed (the higher the value is, the more likely the word will appear in transcription)
Notice here the phrases should be made of lowercase a-z letters. The phrases should be within the vocabulary.
When an OOV(out of vocabulary) phrase is added to the Speech Context, the onListeningErrorHandler
will be called. Here we take a random word az zj
as an example.
//Speech Context
var phrasesOne = ["az zj",”maize”, ...];
var boostValueOne = 5;
options.addSpeechContext(phrasesOne,boostValueOne);
Reset Preview. Speak with the Microphone button enabled in the Preview panel
. We can then see the error message in the Logger.
Keyword Detection
Now that you learnt how to use the basic transcription. Let’s move on to Keyword Detection.
Keyword
For each keyword, we need to define a list of aliases. If these aliases are in the transcription results, the keyword detection will be triggered.
To define keywords:
//Keyword
var keywordOne = "Yellow";
var categoryAliasesOne = ["orange","yellow","maize","light yellow"...];
var keywordTwo = "Black";
var categoryAliasesTwo = ["black",...];
...
Here let’s take the keyword Yellow
as an example. We can define the keyword, Yellow
. Then we can define a list of aliases for Yellow
. Here we have orange
, yellow
, maize
and light yellow
. If these aliases are in the transcription results, the keyword Yellow
will be detected. Aliases give us the ability to expand the subset of words that should return Yellow if needed to serve a specific Lens experience.
When using keyword detection, Snap engine will try to try to mitigate small transcription errors such as plurals instead of singular or similar sounding words (ship/sheep etc), instead, use multiple keywords to think about how different users might say the same thing in different ways like cinema, movies and film.
Notice here we can use more than one word - e.g. - short phrases for aliases.
NLP Keyword Model
After we define all the keywords as needed, we need to define the NLP (Natural Language Processing) Keyword Model and add keyword group to the NLP Keyword Model
:
//NLPKeywordModel
var nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne,categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo,categoryAliasesTwo);
...
Finally, add NLP Keyword Model
to options:
options.nlpModels =[nlpKeywordModel, ...];
To parse the Keyword Responses, we can use parseKeywordResponses
function:
var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
keywords.push(keywordResponse.keywords[keywordsIterator]);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
We will get the keyword results in the onUpdateListeningEventHandler
. Add the following code to the onUpdateListeningEventHandler
after the transcription section.
//Keyword Results
var keywordResponses = eventArgs.getKeywordResponses();
var keywords = parseKeywordResponses(keywordResponses);
if(keywords.length > 0 )
{
var keywordResponseText = "";
for (var kIterator=0;kIterator<keywords.length;kIterator++) {
keywordResponseText += keywords[kIterator] +"\n";
//code to run
...
}
print("Keyword:" + keywordResponseText);
}
};
Error Code for Keyword Responses
There are few error codes which NLP models (either keyword or command detection) might return:
#SNAP_ERROR_INDECISIVE
: if no keyword detected#SNAP_ERROR_NONVERBAL
: if we don’t think the audio input was really a human talking#SNAP_ERROR_SILENCE
: if too long silence- Anything starting with
#SNAP_ERROR_
: Errors that are not currently defined in this document and should be ignored
::: tip If more than one keyword was spoken in a single utterance, we will return a list of all keywords. :::
In the example below, we will add a new function to the script to get error messages.
var getErrorMessage = function (response) {
var errorMessage = '';
switch (response) {
case '#SNAP_ERROR_INDECISIVE':
errorMessage = 'indecisive';
break;
case '#SNAP_ERROR_NONVERBAL':
errorMessage = 'non verbal';
break;
case '#SNAP_ERROR_SILENCE':
errorMessage = 'too long silence';
break;
default:
if (response.includes('#SNAP_ERROR')) {
errorMessage = 'general error';
} else {
errorMessage = 'unknown error';
}
}
return errorMessage;
};
Then add error checking to the parseKeywordResponses
function.
var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
var keyword = keywordResponse.keywords[keywordsIterator];
///////New code for Error Checking///////
if (keyword.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(keyword);
print('Keyword Error: ' + errorMessage);
break;
}
/////////////////////////////////////
keywords.push(keyword);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
We also recommend ignoring all return values starting with #SNAP_ERROR
as Snap is likely to add additional error code in the future.
Reset Preview. Try to say the keywords with the Microphone button enabled in the Preview panel
. We can then see the keyword results in the Logger.
If the keyword result includes #SNAP_ERROR
, try to repeat the word and add the word to the Speech Context for better results.
Check out how to trigger visual effects when keywords are detected in Speech Recognition Template!
Voice Command Detection
Voice Command Detection is based on basic natural language processing on top of transcription. We use the NLP Intent Model for voice command detection.
Currently only English is supported.
Voice Navigation Command
The VOICE_ENABLED_UI
NLP Intent Model supports a list of voice commands: "next", "back", "left", "right", "up", "down", "different", "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth".
Voice Navigation Command is different from Keyword. We don’t have to say the exact word to trigger the voice command. Take Back
as an example. We can say “go back”, “go to the previous one”, or “the one before” etc.
To define VOICE_ENABLED_UI
NLP Intent Model:
//Command
var navigationNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
'next',
'back',
'left',
'right',
'up',
'down',
'different',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth',
'seventh',
'eighth',
'ninth',
'tenth',
];
Here we can also send a shorter list of intents to enable a smaller command set. For example if we only need next
and back
.
navigationNlpIntentModel.possibleIntents = ['next', 'back'];
To add NLP Intent Model to options:
options.nlpModels = [navigationNlpIntentModel];
We can use NLPKeywordModel and NLPIntentModel at the same time as needed.
options.nlpModels = [nlpKeywordModel, navigationNlpIntentModel];
Check out how to trigger visual effects when navigation voice commands are detected in Voice UI Template!
Emotion Classifier
The EMOTION_CLASSIFIER
NLP Intent Model supports 6 categories: "anger", "disgust", "fear", "joy", "sadness","surprise"
.
To define EMOTION_CLASSIFIER
NLP Intent Model:
var emotionNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');
To add "EMOTION_CLASSIFIER" NLP Intent Model to options:
options.nlpModels = [emotionNlpIntentModel];
We can use multiple NLPIntentModels at the same time as needed.
options.nlpModels = [navigationNlpIntentModel, emotionNlpIntentModel];
Yes No Classifier
The YES_NO_CLASSIFIER
NLP Intent Model supports two categories: "positive_intent", "negative_intent"
.
To define YES_NO_CLASSIFIER
NLP Intent Model:
var yesNoNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');
To add YES_NO_CLASSIFIER
NLP Intent Model to options:
options.nlpModels = [
navigationNlpIntentModel,
emotionNlpIntentModel,
yesNoNlpIntentModel,
];
Check out how to trigger visual effects when emotions or positive and negative intents are detected in the Sentiment Analyzer Template!
To parse the Command Responses, we can use parseCommandResponses
function:
var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
We will get the command results in the onUpdateListeningEventHandler
. Add the following code to the onUpdateListeningEventHandler
:
//Command Results
var commandResponses = eventArgs.getIntentResponses();
var commands = parseCommandResponses(commandResponses);
if(commands.length > 0 )
{
var commandResponseText = "";
for (var iIterator=0;iIterator<commands.length;iIterator++) {
commandResponseText += commands[iIterator]+"\n";
//code to run
...
}
print("Commands: " + commandResponseText);
}
Error Code for Command Responses
There are few error codes which NLP models (either keyword or command detection) might return:
#SNAP_ERROR_INDECISIVE
: if no command detected#SNAP_ERROR_NONVERBAL
: if we don’t think the audio input was really a human talking#SNAP_ERROR_SILENCE
: if too long silence- Anything starting with
#SNAP_ERROR_
: Errors that are not currently defined in this document and should be ignored
We can only detect one voice command for a NLPIntentModel each time.
We will reuse the getErrorMessage
function here. And add error checking to parseCommandResponses
function.
var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
///////New code for Error Checking///////
if (command.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(command);
print('Command Error: ' + errorMessage);
break;
}
/////////////////////////////////////
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
We also recommend ignoring all return values starting with #SNAP_ERROR
as Snap is likely to add additional error code in the future.
Reset Preview. Try to say any Navigation voice command with the Microphone button enabled in the Preview panel
. We can then see the voice command results in the Logger.
If the command result includes #SNAP_ERROR
, try to repeat the command or add the command to the Speech Context
for better results.
Here is the full script.
//@input Asset.VoiceMLModule vmlModule {"label": "Voice ML Module"}
var options = VoiceML.ListeningOptions.create();
options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;
//Language Option
// Language Code: "en_US", "es_MX", "de_DE"
options.languageCode = 'en_US';
//General Option
options.shouldReturnAsrTranscription = true;
options.shouldReturnInterimAsrTranscription = true;
//Speech Context
var phrasesOne = ['yellow'];
var boostValueOne = 5;
options.addSpeechContext(phrasesOne, boostValueOne);
//Keyword
var keywordOne = 'yellow';
var categoryAliasesOne = ['yellow', 'maize', 'orange'];
var keywordTwo = 'black';
var categoryAliasesTwo = ['black'];
//NLPKeywordModel
var nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne, categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo, categoryAliasesTwo);
//Command
var navigationNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
'Next',
'Back',
'Left',
'Right',
'Up',
'Down',
'Different',
'First',
'Second',
'Third',
'Fourth',
'Fifth',
'Sixth',
'Seventh',
'Eighth',
'Ninth',
'Tenth',
];
var emotionNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');
var yesNoNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');
options.nlpModels = [
nlpKeywordModel,
navigationNlpIntentModel,
emotionNlpIntentModel,
yesNoNlpIntentModel,
];
var onListeningEnabledHandler = function () {
script.vmlModule.startListening(options);
};
var onListeningDisabledHandler = function () {
script.vmlModule.stopListening();
};
var getErrorMessage = function (response) {
var errorMessage = '';
switch (response) {
case '#SNAP_ERROR_INDECISIVE':
errorMessage = 'indecisive';
break;
case '#SNAP_ERROR_NONVERBAL':
errorMessage = 'non verbal';
break;
case '#SNAP_ERROR_SILENCE':
errorMessage = 'too long silence';
break;
default:
if (response.includes('#SNAP_ERROR')) {
errorMessage = 'general error';
} else {
errorMessage = 'unknown error';
}
}
return errorMessage;
};
var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
var keyword = keywordResponse.keywords[keywordsIterator];
if (keyword.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(keyword);
print('Keyword Error: ' + errorMessage);
break;
}
keywords.push(keyword);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
if (command.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(command);
print('Command Error: ' + errorMessage);
break;
}
commands.push(commandResponse.intent);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
var onUpdateListeningEventHandler = function (eventArgs) {
if (eventArgs.transcription.trim() == '') {
return;
}
print('Transcription: ' + eventArgs.transcription);
if (!eventArgs.isFinalTranscription) {
return;
}
print('Final Transcription: ' + eventArgs.transcription);
//Keyword Results
var keywordResponses = eventArgs.getKeywordResponses();
var keywords = parseKeywordResponses(keywordResponses);
if (keywords.length > 0) {
var keywordResponseText = '';
for (var kIterator = 0; kIterator < keywords.length; kIterator++) {
keywordResponseText += keywords[kIterator] + '\n';
}
print('Keywords:' + keywordResponseText);
}
//Command Results
var commandResponses = eventArgs.getIntentResponses();
var commands = parseCommandResponses(commandResponses);
if (commands.length > 0) {
var commandResponseText = '';
for (var iIterator = 0; iIterator < commands.length; iIterator++) {
commandResponseText += commands[iIterator] + '\n';
}
print('Commands: ' + commandResponseText);
}
};
var onListeningErrorHandler = function (eventErrorArgs) {
print(
'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
);
};
//VoiceML Callbacks
script.vmlModule.onListeningUpdate.add(onUpdateListeningEventHandler);
script.vmlModule.onListeningError.add(onListeningErrorHandler);
script.vmlModule.onListeningEnabled.add(onListeningEnabledHandler);
script.vmlModule.onListeningDisabled.add(onListeningDisabledHandler);
Previewing Your Lens
You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.
Once the Lens is pushed to Snapchat, you will see hint: TAP TO TURN ON VOICE CONTROL
. Tap the screen to start VoiceML and the OnListeningEnabled
will be triggered. Press the button again to stop VoiceML and the OnListeningDisabled
will be triggered.