Speech Recognition
In this guide we will go over some of the concepts that underlie how Speech Recognition works in Lens Studio and Snapchat.
This guide is script heavy and explains how to set up an experience from scratch. Take a look at the Speech Recognition, Voice UI, and Sentiment Analyzer for ready-to-go examples.
VoiceML
The VoiceML Module
allows you to incorporate transcription, keyword detection, and voice command detection based on basic natural language understanding into the Lenses. These capabilities can serve as triggers for various visual effects or other behaviors.
- Transcription: Transcribes speech and returns a transcript. This can be done in real time (Pre-Capture) or during recording.
- Keyword Detection: Allows users to define a list of keywords and detects them based on the Keyword Classifier.
- Voice Navigation Command Detection: Detects a predefined list of voice navigation commands based on basic natural language processing.
- Sentiment Analyzer: Analyzes sentiment based on basic natural language processing.
Overview
This guide will walk through the usage of the VoiceML Module. You can find the complete script that utilizes transcription, keyword detection, voice navigation command detection as well as sentiment analysis at the bottom of this page.
Scripting Example
Create new JavaScript file in the Asset Browser and drag it to the scene. Open script for editing and add the following lines:
const voiceMLModule = require('LensStudio:VoiceMLModule');
Listening Options
The Listening Options object allows to configure the Voice ML Module
. To create options:
let options = VoiceML.ListeningOptions.create();
Set speech recognizer options:
options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;
Set language to one of the following:
options.languageCode = 'en_US';
options.languageCode = 'es_MX';
options.languageCode = 'fr_FR';
options.languageCode = 'de_DE';
Start and Stop Listening
Pass options object as an argument to the startListening
function:
voiceMLModule.startListening(options);
or call stopListening
to stop.
voiceMLModule.stopListening();
Adding Callbacks
Voice ML Module enables adding callbacks to specific voice events such as:
onListeningEnabledHandler
:called when microphone input is enabled.onListeningDisabled
: called when the microphone input is disabled.onListeningUpdate
: called if transcription was successful.onListeningError
: called in case error occurs.
In the Preview panel
, the microphone button is not simulated on the screen. When we reset the preview, once the VoiceML is initialized, On Listening Enabled
will be triggered automatically. To learn how to use the microphone button, try to preview the Lens in Snapchat.
let onListeningEnabledHandler = function () {
VoiceMLModule.startListening(options);
};
let onListeningDisabledHandler = function () {
VoiceMLModule.stopListening();
};
let onListeningErrorHandler = function (eventErrorArgs) {
print(
'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
);
};
let onUpdateListeningEventHandler = function (eventArgs) {
// process transcription
};
voiceMLModule.onListeningEnabled.add(onListeningEnabledHandler);
voiceMLModule.onListeningDisabled.add(onListeningDisabledHandler);
voiceMLModule.onListeningError.add(onListeningErrorHandler);
voiceMLModule.onListeningUpdate.add(onUpdateListeningEventHandler);
Transcription
To enable transcription:
options.shouldReturnAsrTranscription = true;
Here we can also get live but less accurate transcription before we get the final transcription. To enable this setting:
options.shouldReturnInterimAsrTranscription = true;
Let's modify the onUpdateListeningEventHandler
function to process transcription result:
let onUpdateListeningEventHandler = function (eventArgs) {
if (eventArgs.transcription.trim() == '') {
return;
}
print('Transcription: ' + eventArgs.transcription);
if (!eventArgs.isFinalTranscription) {
return;
}
print('Final Transcription: ' + eventArgs.transcription);
};
Reset Preview. Speak with the Microphone button enabled in the Preview panel and see the result in the Logger.
Speech Context
Incorporating speech contexts into transcriptions can enhance the accuracy of specific words by increasing their likelihood of being recognized. This is particularly useful for transcribing less common words that may not be effectively captured by VoiceML. The probability of a word appearing in the transcription is directly proportional to the boost value assigned, with higher values leading to greater likelihood.
To add speech context:
//set speech context
let phrasesOne = ["yellow", "blue", ...];
let boostValueOne = 5;
let phrasesTwo = ["black"];
let boostValueTwo = 5;
options.addSpeechContext(phrasesOne, boostValueOne);
options.addSpeechContext(phrasesTwo, boostValueTwo);
To effectively utilize this feature, consider the following:
- The boost value can range from 1 to 10, with 10 representing the strongest increase in likelihood. It is advisable to start with a default value of 5 and adjust as necessary, depending on the requirements of your specific transcription scenario.
- This approach is particularly beneficial in scenarios where precise recognition of uncommon words is critical.
When an OOV (out of vocabulary) phrase is added to the Speech Context, the onListeningErrorHandler
will be called. Here we take a random word az zj
as an example.
// Speech Context Example
let phrasesOne = ["az zj",”maize”];
let boostValueOne = 5;
options.addSpeechContext(phrasesOne,boostValueOne);
Reset Preview. Speak with the Microphone button enabled in the Preview panel
. We can then see the error message in the Logger.
Keyword Detection
Speech Recognition also enables keyword detection.
For each keyword, it is possible to define a list of aliases. When these aliases appear in the transcription results, the keyword detection feature is activated.
Here let’s take the keyword Yellow
as an example. We can define the keyword, Yellow
. Then we can define a list of aliases for Yellow
. Here we have orange
, yellow
, maize
and light yellow
. If these aliases are in the transcription results, the keyword Yellow
will be detected. Aliases give us the ability to expand the subset of words that should return Yellow if needed to serve a specific Lens experience.
// Keyword Detection Example
let keywordOne = "Yellow";
let categoryAliasesOne = ["orange", "yellow", "maize", "light yellow"];
let keywordTwo = "Black";
let categoryAliasesTwo = ["black",...];
When using keyword detection, Snap engine will try to mitigate small transcription errors such as plurals instead of singular or similar sounding words (ship/sheep etc), instead, use multiple keywords to think about how different users might say the same thing in different ways like cinema, movies and film.
It is allowed to use more than one word - e.g. - short phrases for aliases.
After we define all the keywords as needed, we need to define the NLP (Natural Language Processing) Keyword Model and add keyword group to the NLP Keyword Model
:
//NLPKeywordModel
let nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne,categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo,categoryAliasesTwo);
...
Finally, add NLP Keyword Model
to options:
options.nlpModels =[nlpKeywordModel, ...];
To parse the Keyword Responses, we can use parseKeywordResponses
function:
let parseKeywordResponses = function (keywordResponses) {
let keywords = [];
let code = '';
for (let kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
let keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
let keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
keywords.push(keywordResponse.keywords[keywordsIterator]);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
We will get the keyword results in the onUpdateListeningEventHandler
. Add the following code to the onUpdateListeningEventHandler
after the transcription section.
//Keyword Results
let keywordResponses = eventArgs.getKeywordResponses();
let keywords = parseKeywordResponses(keywordResponses);
if(keywords.length > 0 )
{
let keywordResponseText = "";
for (let kIterator=0;kIterator<keywords.length;kIterator++) {
keywordResponseText += keywords[kIterator] +"\n";
//code to run
...
}
print("Keyword:" + keywordResponseText);
}
};
Error Codes for Keyword Responses
#SNAP_ERROR_INDECISIVE
: if no keyword is detected.#SNAP_ERROR_NONVERBAL
: if we don’t think the audio input was really a human talking.#SNAP_ERROR_SILENCE
: if there is an extended silence.- Anything starting with
#SNAP_ERROR_
: errors that are not currently defined in this document and should be ignored.
::: tip If more than one keyword was spoken in a single utterance, we will return a list of all keywords. :::
In the example below, we will add a new function to the script to get error messages.
let getErrorMessage = function (response) {
let errorMessage = '';
switch (response) {
case '#SNAP_ERROR_INDECISIVE':
errorMessage = 'indecisive';
break;
case '#SNAP_ERROR_NONVERBAL':
errorMessage = 'non verbal';
break;
case '#SNAP_ERROR_SILENCE':
errorMessage = 'too long silence';
break;
default:
if (response.includes('#SNAP_ERROR')) {
errorMessage = 'general error';
} else {
errorMessage = 'unknown error';
}
}
return errorMessage;
};
Then add error checking to the parseKeywordResponses
function.
let parseKeywordResponses = function (keywordResponses) {
let keywords = [];
let code = '';
for (let kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
let keywordResponse = keywordResponses[kIterator];
switch (keywordResponse.status.code) {
case 0:
code = 'OK';
for (
let keywordsIterator = 0;
keywordsIterator < keywordResponse.keywords.length;
keywordsIterator++
) {
let keyword = keywordResponse.keywords[keywordsIterator];
///////New code for Error Checking///////
if (keyword.includes('#SNAP_ERROR')) {
let errorMessage = getErrorMessage(keyword);
print('Keyword Error: ' + errorMessage);
break;
}
/////////////////////////////////////
keywords.push(keyword);
}
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
keywordResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return keywords;
};
We also recommend ignoring all return values starting with #SNAP_ERROR
as Snap is likely to add additional error code in the future.
Reset Preview. Try to say the keywords with the Microphone button enabled in the Preview panel
. We can then see the keyword results in the Logger.
If the keyword result includes #SNAP_ERROR
, try to repeat the word and add the word to the Speech Context for better results.
Check out how to trigger visual effects when keywords are detected in Speech Recognition example!
Voice Command Detection
Voice Command Detection is based on basic natural language processing on top of transcription. We use the NLP Intent Model for voice command detection.
Currently only English is supported.
Voice Navigation Command
The VOICE_ENABLED_UI
NLP Intent Model supports a list of voice commands: "next", "back", "left", "right", "up", "down", "different", "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth".
Voice Navigation Command is different from Keyword. We don’t have to say the exact word to trigger the voice command. Take back
as an example. We can say “go back”, “go to the previous one”, or “the one before” etc.
To define VOICE_ENABLED_UI
NLP Intent Model:
//Command
let navigationNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
'next',
'back',
'left',
'right',
'up',
'down',
'different',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth',
'seventh',
'eighth',
'ninth',
'tenth',
];
Here we can also send a shorter list of intents to enable a smaller command set. For example if we only need next
and back
.
navigationNlpIntentModel.possibleIntents = ['next', 'back'];
To add NLP Intent Model to options:
options.nlpModels = [navigationNlpIntentModel];
We can use NLPKeywordModel and NLPIntentModel at the same time as needed.
options.nlpModels = [nlpKeywordModel, navigationNlpIntentModel];
Check out how to trigger visual effects when navigation voice commands are detected in Voice UI example!
Emotion Classifier
The EMOTION_CLASSIFIER
NLP Intent Model supports 6 categories: anger
, disgust
, fear
, joy
, sadness
, surprise
.
To define EMOTION_CLASSIFIER
NLP Intent Model:
let emotionNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');
To add "EMOTION_CLASSIFIER" NLP Intent Model to options:
options.nlpModels = [emotionNlpIntentModel];
We can use multiple NLPIntentModels at the same time as needed.
options.nlpModels = [navigationNlpIntentModel, emotionNlpIntentModel];
Yes No Classifier
The YES_NO_CLASSIFIER
NLP Intent Model supports two categories: "positive_intent", "negative_intent"
.
To define YES_NO_CLASSIFIER
NLP Intent Model:
let yesNoNlpIntentModel =
VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');
To add YES_NO_CLASSIFIER
NLP Intent Model to options:
options.nlpModels = [
navigationNlpIntentModel,
emotionNlpIntentModel,
yesNoNlpIntentModel,
];
Check out how to trigger visual effects when emotions or positive and negative intents are detected in the Sentiment Analyzer Example!
To parse the Command Responses, we can use parseCommandResponses
function:
let parseCommandResponses = function (commandResponses) {
let commands = [];
let code = '';
for (let iIterator = 0; iIterator < commandResponses.length; iIterator++) {
let commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
let command = commandResponse.intent;
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
We will get the command results in the onUpdateListeningEventHandler
. Add the following code to the onUpdateListeningEventHandler
:
//Command Results
let commandResponses = eventArgs.getIntentResponses();
let commands = parseCommandResponses(commandResponses);
if(commands.length > 0 )
{
let commandResponseText = "";
for (let iIterator=0;iIterator<commands.length;iIterator++) {
commandResponseText += commands[iIterator]+"\n";
//code to run
...
}
print("Commands: " + commandResponseText);
}
Error Code for Command Responses
There are few error codes which NLP models (either keyword or command detection) might return:
#SNAP_ERROR_INDECISIVE
: if no command detected#SNAP_ERROR_NONVERBAL
: if we don’t think the audio input was really a human talking#SNAP_ERROR_SILENCE
: if too long silence- Anything starting with
#SNAP_ERROR_
: Errors that are not currently defined in this document and should be ignored
We can only detect one voice command for a NLPIntentModel each time.
We will reuse the getErrorMessage
function here. And add error checking to parseCommandResponses
function.
let parseCommandResponses = function (commandResponses) {
let commands = [];
let code = '';
for (let iIterator = 0; iIterator < commandResponses.length; iIterator++) {
let commandResponse = commandResponses[iIterator];
switch (commandResponse.status.code) {
case 0:
code = 'OK';
let command = commandResponse.intent;
///////New code for Error Checking///////
if (command.includes('#SNAP_ERROR')) {
let errorMessage = getErrorMessage(command);
print('Command Error: ' + errorMessage);
break;
}
/////////////////////////////////////
commands.push(command);
break;
case 1:
code = 'ERROR';
print(
'Status Code: ' +
code +
' Description: ' +
commandResponse.status.code.description
);
break;
default:
print('Status Code: No Status Code');
}
}
return commands;
};
We also recommend ignoring all return values starting with #SNAP_ERROR
as Snap is likely to add additional error code in the future.
Reset Preview. Try to say any Navigation voice command with the Microphone button enabled in the Preview panel
. We can then see the voice command results in the Logger.
If the command result includes #SNAP_ERROR
, try to repeat the command or add the command to the Speech Context
for better results.
Click to see the full script
Limitations
Transcription and Keyword Detection are available for the English, Spanish and German languages.
Voice Command Detection is only available for the English language.
Transcription limitations include such things as new names for things, slang words or acute accents.
Please keep the following in mind when working with Sentiment Analyzer:
- Do not play sound or speech from the Lens while activating the microphone to capture sound.
- Try to avoid background noise and far device distance while activating the microphone to capture sound.
- If the microphone is muted for more than two minutes, the transcription won't continue after unmuting, and you’ll need to reset the Preview panel to enable it.
- If you were previously logged on to MyLenses and are having trouble seeing the preview in Lens Studio, logout from MyLenses and login to MyLenses again.
Here is how to logout and login to MyLenses.
Previewing Your Lens
You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.
Once the Lens is pushed to Snapchat, you will see hint: TAP TO TURN ON VOICE CONTROL
. Tap the screen to start VoiceML and the OnListeningEnabled
will be triggered. Press the button again to stop VoiceML and the OnListeningDisabled
will be triggered.