Skip to main content
Version: 5.x

Speech Recognition

In this guide, we will go over some of the bigger concepts that underlie how Speech Recognition works in Lens Studio and Snapchat. This guide is script heavy. But don’t worry, you can also take a look at the Speech Recognition Template, Voice UI Template, and Sentiment Analyzer.

VoiceML

VoiceML allows you to incorporate transcription and keyword detection as well as voice command detection based on basic natural language understanding into the Lenses. These capabilities can serve as a trigger for a visual effect or any other behavior.

  • Transcription: transcribes speech and returns a transcript. This can be done in real time (Pre-Capture) or during recording.
  • Keyword Detection: allows users to define a list of keywords and detects keywords based on Keyword Classifier.
  • Voice Navigation Command Detection: detects a predefined list of voice navigation commands based on basic natural language processing
  • Sentiment Analyzer: analysis sentiment based on basic natural language processing
  • Transcription and keyword detection are now available for the English, Spanish, French and German languages. Voice Navigation Command Detection is only available for the English language. Transcription limitations include for example new names for things, slang words or acute accents.
  • Do not play sound or speech from the Lens while activating the microphone to capture sound.
  • Try to avoid background noise and far device distance while activating the microphone to capture sound.
  • If the microphone is muted for more than two minutes, the transcription won't continue after unmuting, and you’ll need to reset the Preview panel to enable it.
  • If you were previously logged on to MyLenses and are having trouble seeing the preview in Lens Studio, logout from MyLenses and login to MyLenses again.

Here is how to logout and login to MyLenses.

Overview

This guide will walk through the usage of the VoiceML module. You can find the complete script that utilizes transcription, keyword detection, voice navigation command detection as well as sentiment analysis at the bottom of this page.

VoiceML Module

The main asset used for VoiceML is VoiceML Module. Add VoiceML Module from Resource panel. In the Asset Browser panel, select + -> VoiceML Module. With VoiceML Module we can configure all the settings for VoiceML.

In the bottom of the Preview panel, click the microphone. Test with your voice to see the blue vertical volume meter in action to ensure you are not muted.

Connect the Script

In the Asset Browser panel, select + -> Script. And then let’s create an object in the scene. With our object selected, add a Script Component in the Inspector panel. Select + Add Component -> Script. Click the + Add Script field and select the script resource we just created.

In the script, to create a reference to VoiceMLModule:

//@input Asset.VoiceMLModule vmlModule {"label": "Voice ML Module"}

Options

We use the option to configure Voice ML Module. To create options:

var options = VoiceML.ListeningOptions.create();

Define functions for Voice Callbacks

onListeningEnabledHandler: will get called when the microphone is enabled, We recommend to you to use the StartListening method here. Calling the StartListening method before the microphone was enabled will result in an error.

When previewing your project in Lens Studio, this would be called as you press the icon described above. In phone version, tap the screen to trigger this function.

var onListeningEnabledHandler = function () {
script.vmlModule.startListening(options);
};

onListeningDisabledHandler: will get called when the microphone is disabled.

var onListeningDisabledHandler = function () {
script.vmlModule.stopListening();
};

onListeningErrorHandler: will get called when there is error in the transcription.

var onListeningErrorHandler = function (eventErrorArgs) {
print(
'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
);
};

onUpdateListeningEventHandler: will get called when transcription is successful.

var onUpdateListeningEventHandler = function(eventArgs) {
...
};

Add callbacks to VoiceML Module:

//VoiceML Callbacks
script.vmlModule.onListeningUpdate.add(onUpdateListeningEventHandler);
script.vmlModule.onListeningError.add(onListeningErrorHandler);
script.vmlModule.onListeningEnabled.add(onListeningEnabledHandler);
script.vmlModule.onListeningDisabled.add(onListeningDisabledHandler);

In the Preview panel, the microphone button is not simulated on the screen. When we reset the preview, once the VoiceML is initialized, On Listening Enabled will be triggered automatically. To learn how to use the microphone button, try to preview the Lens in Snapchat. Please check the details in Previewing Your Lens.

Transcription

With VoiceML, we can recognize speech, transcribe it to text and get a transcription. Right after we create options, we need to define the basic settings for transcription.

Basic Settings for Transcription

SpeechRecognizer:

options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;

Language settings for English, Spanish, French and German:

options.languageCode = 'en_US';
options.languageCode = 'es_MX';
options.languageCode = 'fr_FR';
options.languageCode = 'de_DE';

To enable transcription:

//General Option for Transcription
options.shouldReturnAsrTranscription = true;

Here we can also get live but less accurate transcription before we get the final transcription.To enable this setting:

options.shouldReturnInterimAsrTranscription = true;

We can then get the transcription result in the onUpdateListeningEventHandler:

var onUpdateListeningEventHandler = function (eventArgs) {
if (eventArgs.transcription.trim() == '') {
return;
}
print('Transcription: ' + eventArgs.transcription);

if (!eventArgs.isFinalTranscription) {
return;
}
print('Final Transcription: ' + eventArgs.transcription);
};

Reset Preview. Speak with the Microphone button enabled in the Preview panel and see the result in the Logger.

VoiceML can transcribe Standard English words. Its limitations include, for example, new names for things or slang words. Here is the current language model dictionary.

Speech Context

We can also add speech contexts to the transcription and boost some of the words for specific transcription scenarios. Use this when transcribing words which are rarer and aren’t picked up well enough by VoiceML, the higher the boost value will be, the more likely the word to appear in transcription.

To add speech context:

//Speech Context
var phrasesOne = ["yellow","blue", ...];
var boostValueOne = 5;
var phrasesTwo = ["black", ...];
var boostValueTwo = 5;
options.addSpeechContext(phrasesOne,boostValueOne);
options.addSpeechContext(phrasesTwo,boostValueTwo);
...

The range for boost value is from 1-10, we recommend you’ll start with 5 and adjust if needed (the higher the value is, the more likely the word will appear in transcription)

Notice here the phrases should be made of lowercase a-z letters. The phrases should be within the vocabulary.

When an OOV(out of vocabulary) phrase is added to the Speech Context, the onListeningErrorHandler will be called. Here we take a random word az zj as an example.

//Speech Context
var phrasesOne = ["az zj",”maize”, ...];
var boostValueOne = 5;
options.addSpeechContext(phrasesOne,boostValueOne);

Reset Preview. Speak with the Microphone button enabled in the Preview panel. We can then see the error message in the Logger.

Keyword Detection

Now that you learnt how to use the basic transcription. Let’s move on to Keyword Detection.

Keyword

For each keyword, we need to define a list of aliases. If these aliases are in the transcription results, the keyword detection will be triggered.

To define keywords:

//Keyword
var keywordOne = "Yellow";
var categoryAliasesOne = ["orange","yellow","maize","light yellow"...];
var keywordTwo = "Black";
var categoryAliasesTwo = ["black",...];
...

Here let’s take the keyword Yellow as an example. We can define the keyword, Yellow. Then we can define a list of aliases for Yellow. Here we have orange, yellow, maize and light yellow. If these aliases are in the transcription results, the keyword Yellow will be detected. Aliases give us the ability to expand the subset of words that should return Yellow if needed to serve a specific Lens experience.

When using keyword detection, Snap engine will try to try to mitigate small transcription errors such as plurals instead of singular or similar sounding words (ship/sheep etc), instead, use multiple keywords to think about how different users might say the same thing in different ways like cinema, movies and film.  

Notice here we can use more than one word - e.g. - short phrases for aliases.

NLP Keyword Model

After we define all the keywords as needed, we need to define the NLP (Natural Language Processing) Keyword Model and add keyword group to the NLP Keyword Model:

//NLPKeywordModel
var nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne,categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo,categoryAliasesTwo);
...

Finally, add NLP Keyword Model to options:

options.nlpModels =[nlpKeywordModel, ...];

To parse the Keyword Responses, we can use parseKeywordResponses function:

var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
keywords.push(keywordResponse.keywords[keywordsIterator]);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};

We will get the keyword results in the onUpdateListeningEventHandler. Add the following code to the onUpdateListeningEventHandler after the transcription section.

   //Keyword Results
var keywordResponses = eventArgs.getKeywordResponses();
var keywords = parseKeywordResponses(keywordResponses);
if(keywords.length > 0 )
{
var keywordResponseText = "";
for (var kIterator=0;kIterator<keywords.length;kIterator++) {
keywordResponseText += keywords[kIterator] +"\n";
//code to run
...

}
print("Keyword:" + keywordResponseText);
}
};

Error Code for Keyword Responses

There are few error codes which NLP models (either keyword or command detection) might return:

  • #SNAP_ERROR_INDECISIVE: if no keyword detected
  • #SNAP_ERROR_NONVERBAL: if we don’t think the audio input was really a human talking
  • #SNAP_ERROR_SILENCE: if too long silence
  • Anything starting with #SNAP_ERROR_: Errors that are not currently defined in this document and should be ignored

::: tip If more than one keyword was spoken in a single utterance, we will return a list of all keywords. :::

In the example below, we will add a new function to the script to get error messages.

var getErrorMessage = function (response) {
var errorMessage = '';
switch (response) {
case '#SNAP_ERROR_INDECISIVE':
errorMessage = 'indecisive';
break;
case '#SNAP_ERROR_NONVERBAL':
errorMessage = 'non verbal';
break;
case '#SNAP_ERROR_SILENCE':
errorMessage = 'too long silence';
break;
default:
if (response.includes('#SNAP_ERROR')) {
errorMessage = 'general error';
} else {
errorMessage = 'unknown error';
}
}
return errorMessage;
};

Then add error checking to the parseKeywordResponses function.

var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
var keyword = keywordResponse.keywords[keywordsIterator];
///////New code for Error Checking///////
if (keyword.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(keyword);
print('Keyword Error: ' + errorMessage);
break;
}
/////////////////////////////////////
keywords.push(keyword);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};

We also recommend ignoring all return values starting with #SNAP_ERROR as Snap is likely to add additional error code in the future.

Reset Preview. Try to say the keywords with the Microphone button enabled in the Preview panel. We can then see the keyword results in the Logger.

If the keyword result includes #SNAP_ERROR, try to repeat the word and add the word to the Speech Context for better results.

Check out how to trigger visual effects when keywords are detected in Speech Recognition Template!

Voice Command Detection

Voice Command Detection is based on basic natural language processing on top of transcription. We use the NLP Intent Model for voice command detection.

Currently only English is supported.

Voice Navigation Command

The VOICE_ENABLED_UI NLP Intent Model supports a list of voice commands: "next", "back", "left", "right", "up", "down", "different", "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth".

Voice Navigation Command is different from Keyword. We don’t have to say the exact word to trigger the voice command. Take Back as an example. We can say “go back”, “go to the previous one”, or “the one before” etc.

To define VOICE_ENABLED_UI NLP Intent Model:

//Command
var navigationNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
'next',
'back',
'left',
'right',
'up',
'down',
'different',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth',
'seventh',
'eighth',
'ninth',
'tenth',
];

Here we can also send a shorter list of intents to enable a smaller command set. For example if we only need next and back.

navigationNlpIntentModel.possibleIntents = ['next', 'back'];

To add NLP Intent Model to options:

options.nlpModels = [navigationNlpIntentModel];

We can use NLPKeywordModel and NLPIntentModel at the same time as needed.

options.nlpModels = [nlpKeywordModel, navigationNlpIntentModel];

Check out how to trigger visual effects when navigation voice commands are detected in Voice UI Template!

Emotion Classifier

The EMOTION_CLASSIFIER NLP Intent Model supports 6 categories: "anger", "disgust", "fear", "joy", "sadness","surprise".

To define EMOTION_CLASSIFIER NLP Intent Model:

var emotionNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');

To add "EMOTION_CLASSIFIER" NLP Intent Model to options:

options.nlpModels = [emotionNlpIntentModel];

We can use multiple NLPIntentModels at the same time as needed.

options.nlpModels = [navigationNlpIntentModel, emotionNlpIntentModel];

Yes No Classifier

The YES_NO_CLASSIFIER NLP Intent Model supports two categories: "positive_intent", "negative_intent".

To define YES_NO_CLASSIFIER NLP Intent Model:

var yesNoNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');

To add YES_NO_CLASSIFIER NLP Intent Model to options:

options.nlpModels = [
navigationNlpIntentModel,
emotionNlpIntentModel,
yesNoNlpIntentModel,
];

Check out how to trigger visual effects when emotions or positive and negative intents are detected in the Sentiment Analyzer Template!

To parse the Command Responses, we can use parseCommandResponses function:

var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};

We will get the command results in the onUpdateListeningEventHandler. Add the following code to the onUpdateListeningEventHandler:

   //Command Results
var commandResponses = eventArgs.getIntentResponses();
var commands = parseCommandResponses(commandResponses);
if(commands.length > 0 )
{
var commandResponseText = "";
for (var iIterator=0;iIterator<commands.length;iIterator++) {
commandResponseText += commands[iIterator]+"\n";
//code to run
...
}
print("Commands: " + commandResponseText);
}

Error Code for Command Responses

There are few error codes which NLP models (either keyword or command detection) might return:

  • #SNAP_ERROR_INDECISIVE: if no command detected
  • #SNAP_ERROR_NONVERBAL: if we don’t think the audio input was really a human talking
  • #SNAP_ERROR_SILENCE: if too long silence
  • Anything starting with #SNAP_ERROR_: Errors that are not currently defined in this document and should be ignored

We can only detect one voice command for a NLPIntentModel each time.

We will reuse the getErrorMessage function here. And add error checking to parseCommandResponses function.

var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
///////New code for Error Checking///////
if (command.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(command);
print('Command Error: ' + errorMessage);
break;
}
/////////////////////////////////////
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};

We also recommend ignoring all return values starting with #SNAP_ERROR as Snap is likely to add additional error code in the future.

Reset Preview. Try to say any Navigation voice command with the Microphone button enabled in the Preview panel. We can then see the voice command results in the Logger.

If the command result includes #SNAP_ERROR, try to repeat the command or add the command to the Speech Context for better results.

Here is the full script.

//@input Asset.VoiceMLModule vmlModule {"label": "Voice ML Module"}
var options = VoiceML.ListeningOptions.create();
options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;
//Language Option
// Language Code: "en_US", "es_MX", "de_DE"
options.languageCode = 'en_US';
//General Option
options.shouldReturnAsrTranscription = true;
options.shouldReturnInterimAsrTranscription = true;
//Speech Context
var phrasesOne = ['yellow'];
var boostValueOne = 5;
options.addSpeechContext(phrasesOne, boostValueOne);
//Keyword
var keywordOne = 'yellow';
var categoryAliasesOne = ['yellow', 'maize', 'orange'];
var keywordTwo = 'black';
var categoryAliasesTwo = ['black'];
//NLPKeywordModel
var nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne, categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo, categoryAliasesTwo);
//Command
var navigationNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
'Next',
'Back',
'Left',
'Right',
'Up',
'Down',
'Different',
'First',
'Second',
'Third',
'Fourth',
'Fifth',
'Sixth',
'Seventh',
'Eighth',
'Ninth',
'Tenth',
];
var emotionNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');
var yesNoNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');
options.nlpModels = [
nlpKeywordModel,
navigationNlpIntentModel,
emotionNlpIntentModel,
yesNoNlpIntentModel,
];
var onListeningEnabledHandler = function () {
script.vmlModule.startListening(options);
};
var onListeningDisabledHandler = function () {
script.vmlModule.stopListening();
};
var getErrorMessage = function (response) {
var errorMessage = '';
switch (response) {
case '#SNAP_ERROR_INDECISIVE':
errorMessage = 'indecisive';
break;
case '#SNAP_ERROR_NONVERBAL':
errorMessage = 'non verbal';
break;
case '#SNAP_ERROR_SILENCE':
errorMessage = 'too long silence';
break;
default:
if (response.includes('#SNAP_ERROR')) {
errorMessage = 'general error';
} else {
errorMessage = 'unknown error';
}
}
return errorMessage;
};
var parseKeywordResponses = function (keywordResponses) {
var keywords = [];
var code = '';
for (var kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
var keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
var keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
var keyword = keywordResponse.keywords[keywordsIterator];
if (keyword.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(keyword);
print('Keyword Error: ' + errorMessage);
break;
}
keywords.push(keyword);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
var parseCommandResponses = function (commandResponses) {
var commands = [];
var code = '';
for (var iIterator = 0; iIterator < commandResponses.length; iIterator++) {
var commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
var command = commandResponse.intent;
if (command.includes('#SNAP_ERROR')) {
var errorMessage = getErrorMessage(command);
print('Command Error: ' + errorMessage);
break;
}
commands.push(commandResponse.intent);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
var onUpdateListeningEventHandler = function (eventArgs) {
if (eventArgs.transcription.trim() == '') {
return;
}
print('Transcription: ' + eventArgs.transcription);
if (!eventArgs.isFinalTranscription) {
return;
}
print('Final Transcription: ' + eventArgs.transcription);
//Keyword Results
var keywordResponses = eventArgs.getKeywordResponses();
var keywords = parseKeywordResponses(keywordResponses);
if (keywords.length > 0) {
var keywordResponseText = '';
for (var kIterator = 0; kIterator < keywords.length; kIterator++) {
keywordResponseText += keywords[kIterator] + '\n';
}
print('Keywords:' + keywordResponseText);
}
//Command Results
var commandResponses = eventArgs.getIntentResponses();
var commands = parseCommandResponses(commandResponses);
if (commands.length > 0) {
var commandResponseText = '';
for (var iIterator = 0; iIterator < commands.length; iIterator++) {
commandResponseText += commands[iIterator] + '\n';
}
print('Commands: ' + commandResponseText);
}
};
var onListeningErrorHandler = function (eventErrorArgs) {
print(
'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
);
};
//VoiceML Callbacks
script.vmlModule.onListeningUpdate.add(onUpdateListeningEventHandler);
script.vmlModule.onListeningError.add(onListeningErrorHandler);
script.vmlModule.onListeningEnabled.add(onListeningEnabledHandler);
script.vmlModule.onListeningDisabled.add(onListeningDisabledHandler);

Previewing Your Lens

You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.

Once the Lens is pushed to Snapchat, you will see hint: TAP TO TURN ON VOICE CONTROL. Tap the screen to start VoiceML and the OnListeningEnabled will be triggered. Press the button again to stop VoiceML and the OnListeningDisabled will be triggered.

Was this page helpful?
Yes
No

AI-Powered Search