Version: 5.x

Supported on

Snapchat

Spectacles

This feature may have limited compatibility and may not perform optimally.

Speech Recognition

In this guide we will go over some of the concepts that underlie how Speech Recognition works in Lens Studio and Snapchat.

This guide is script heavy and explains how to set up an experience from scratch. Take a look at the Speech Recognition, Voice UI, and Sentiment Analyzer for ready-to-go examples.

VoiceML

The VoiceML Module allows you to incorporate transcription, keyword detection, and voice command detection based on basic natural language understanding into the Lenses. These capabilities can serve as triggers for various visual effects or other behaviors.

Transcription: Transcribes speech and returns a transcript. This can be done in real time (Pre-Capture) or during recording.
Keyword Detection: Allows users to define a list of keywords and detects them based on the Keyword Classifier.
Voice Navigation Command Detection: Detects a predefined list of voice navigation commands based on basic natural language processing.
Sentiment Analyzer: Analyzes sentiment based on basic natural language processing.

Overview

This guide will walk through the usage of the VoiceML Module. You can find the complete script that utilizes transcription, keyword detection, voice navigation command detection as well as sentiment analysis at the bottom of this page.

Scripting Example

Create new JavaScript file in the Asset Browser and drag it to the scene. Open script for editing and add the following lines:

const voiceMLModule = require('LensStudio:VoiceMLModule');

Listening Options

The Listening Options object allows to configure the Voice ML Module. To create options:

let options = VoiceML.ListeningOptions.create();

Set speech recognizer options:

options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;

Set language to one of the following:

options.languageCode = 'en_US';
options.languageCode = 'es_MX';
options.languageCode = 'fr_FR';
options.languageCode = 'de_DE';

Start and Stop Listening

Pass options object as an argument to the startListening function:

voiceMLModule.startListening(options);

or call stopListening to stop.

voiceMLModule.stopListening();

Adding Callbacks

Voice ML Module enables adding callbacks to specific voice events such as:

onListeningEnabledHandler:called when microphone input is enabled.
onListeningDisabled: called when the microphone input is disabled.
onListeningUpdate: called if transcription was successful.
onListeningError: called in case error occurs.

In the Preview panel, the microphone button is not simulated on the screen. When we reset the preview, once the VoiceML is initialized, On Listening Enabled will be triggered automatically. To learn how to use the microphone button, try to preview the Lens in Snapchat.

let onListeningEnabledHandler = function () {
  VoiceMLModule.startListening(options);
};

let onListeningDisabledHandler = function () {
  VoiceMLModule.stopListening();
};

let onListeningErrorHandler = function (eventErrorArgs) {
  print(
    'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
  );
};

let onUpdateListeningEventHandler = function (eventArgs) {
  // process transcription
};

voiceMLModule.onListeningEnabled.add(onListeningEnabledHandler);
voiceMLModule.onListeningDisabled.add(onListeningDisabledHandler);
voiceMLModule.onListeningError.add(onListeningErrorHandler);
voiceMLModule.onListeningUpdate.add(onUpdateListeningEventHandler);

Transcription

To enable transcription:

options.shouldReturnAsrTranscription = true;

Here we can also get live but less accurate transcription before we get the final transcription. To enable this setting:

options.shouldReturnInterimAsrTranscription = true;

Let's modify the onUpdateListeningEventHandler function to process transcription result:

let onUpdateListeningEventHandler = function (eventArgs) {
  if (eventArgs.transcription.trim() == '') {
    return;
  }
  print('Transcription: ' + eventArgs.transcription);

  if (!eventArgs.isFinalTranscription) {
    return;
  }
  print('Final Transcription: ' + eventArgs.transcription);
};

Reset Preview. Speak with the Microphone button enabled in the Preview panel and see the result in the Logger.

Speech Context

Incorporating speech contexts into transcriptions can enhance the accuracy of specific words by increasing their likelihood of being recognized. This is particularly useful for transcribing less common words that may not be effectively captured by VoiceML. The probability of a word appearing in the transcription is directly proportional to the boost value assigned, with higher values leading to greater likelihood.

To add speech context:

//set speech context
let phrasesOne = ["yellow", "blue", ...];
let boostValueOne = 5;
let phrasesTwo = ["black"];
let boostValueTwo = 5;

options.addSpeechContext(phrasesOne, boostValueOne);
options.addSpeechContext(phrasesTwo, boostValueTwo);

To effectively utilize this feature, consider the following:

The boost value can range from 1 to 10, with 10 representing the strongest increase in likelihood. It is advisable to start with a default value of 5 and adjust as necessary, depending on the requirements of your specific transcription scenario.
This approach is particularly beneficial in scenarios where precise recognition of uncommon words is critical.

When an OOV (out of vocabulary) phrase is added to the Speech Context, the onListeningErrorHandler will be called. Here we take a random word az zj as an example.

// Speech Context Example
let phrasesOne = ["az zj",”maize”];
let boostValueOne = 5;
options.addSpeechContext(phrasesOne,boostValueOne);

Reset Preview. Speak with the Microphone button enabled in the Preview panel. We can then see the error message in the Logger.

Keyword Detection

Speech Recognition also enables keyword detection.

For each keyword, it is possible to define a list of aliases. When these aliases appear in the transcription results, the keyword detection feature is activated.

Here let’s take the keyword Yellow as an example. We can define the keyword, Yellow. Then we can define a list of aliases for Yellow. Here we have orange, yellow, maize and light yellow. If these aliases are in the transcription results, the keyword Yellow will be detected. Aliases give us the ability to expand the subset of words that should return Yellow if needed to serve a specific Lens experience.

// Keyword Detection Example
let keywordOne = "Yellow";
let categoryAliasesOne = ["orange", "yellow", "maize", "light yellow"];
let keywordTwo = "Black";
let categoryAliasesTwo = ["black",...];

When using keyword detection, Snap engine will try to mitigate small transcription errors such as plurals instead of singular or similar sounding words (ship/sheep etc), instead, use multiple keywords to think about how different users might say the same thing in different ways like cinema, movies and film.

It is allowed to use more than one word - e.g. - short phrases for aliases.

After we define all the keywords as needed, we need to define the NLP (Natural Language Processing) Keyword Model and add keyword group to the NLP Keyword Model:

//NLPKeywordModel
let nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne,categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo,categoryAliasesTwo);
...

Finally, add NLP Keyword Model to options:

options.nlpModels =[nlpKeywordModel, ...];

To parse the Keyword Responses, we can use parseKeywordResponses function:

let parseKeywordResponses = function (keywordResponses) {
  let keywords = [];
  let code = '';
  for (let kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
    let keywordResponse = keywordResponses[kIterator];
    switch (keywordResponse.status.code) {
      case 0:
        code = 'OK';
        for (
          let keywordsIterator = 0;
          keywordsIterator < keywordResponse.keywords.length;
          keywordsIterator++
        ) {
          keywords.push(keywordResponse.keywords[keywordsIterator]);
        }
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            keywordResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return keywords;
};

We will get the keyword results in the onUpdateListeningEventHandler. Add the following code to the onUpdateListeningEventHandler after the transcription section.

   //Keyword Results
   let keywordResponses = eventArgs.getKeywordResponses();
   let keywords = parseKeywordResponses(keywordResponses);
   if(keywords.length > 0 )
   {
       let keywordResponseText = "";
       for (let kIterator=0;kIterator<keywords.length;kIterator++) {
           keywordResponseText += keywords[kIterator] +"\n";
    //code to run
    ...

       }
       print("Keyword:" + keywordResponseText);
   }
};

Error Codes for Keyword Responses

#SNAP_ERROR_INDECISIVE: if no keyword is detected.
#SNAP_ERROR_NONVERBAL: if we don’t think the audio input was really a human talking.
#SNAP_ERROR_SILENCE: if there is an extended silence.
Anything starting with #SNAP_ERROR_: errors that are not currently defined in this document and should be ignored.

::: tip If more than one keyword was spoken in a single utterance, we will return a list of all keywords. :::

In the example below, we will add a new function to the script to get error messages.

let getErrorMessage = function (response) {
  let errorMessage = '';
  switch (response) {
    case '#SNAP_ERROR_INDECISIVE':
      errorMessage = 'indecisive';
      break;
    case '#SNAP_ERROR_NONVERBAL':
      errorMessage = 'non verbal';
      break;
    case '#SNAP_ERROR_SILENCE':
      errorMessage = 'too long silence';
      break;
    default:
      if (response.includes('#SNAP_ERROR')) {
        errorMessage = 'general error';
      } else {
        errorMessage = 'unknown error';
      }
  }
  return errorMessage;
};

Then add error checking to the parseKeywordResponses function.

let parseKeywordResponses = function (keywordResponses) {
  let keywords = [];
  let code = '';
  for (let kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
    let keywordResponse = keywordResponses[kIterator];
    switch (keywordResponse.status.code) {
      case 0:
        code = 'OK';
        for (
          let keywordsIterator = 0;
          keywordsIterator < keywordResponse.keywords.length;
          keywordsIterator++
        ) {
          let keyword = keywordResponse.keywords[keywordsIterator];
          ///////New code for Error Checking///////
          if (keyword.includes('#SNAP_ERROR')) {
            let errorMessage = getErrorMessage(keyword);
            print('Keyword Error: ' + errorMessage);
            break;
          }
          /////////////////////////////////////
          keywords.push(keyword);
        }
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            keywordResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return keywords;
};

We also recommend ignoring all return values starting with #SNAP_ERROR as Snap is likely to add additional error code in the future.

Reset Preview. Try to say the keywords with the Microphone button enabled in the Preview panel. We can then see the keyword results in the Logger.

If the keyword result includes #SNAP_ERROR, try to repeat the word and add the word to the Speech Context for better results.

Check out how to trigger visual effects when keywords are detected in Speech Recognition example!

Voice Command Detection

Voice Command Detection is based on basic natural language processing on top of transcription. We use the NLP Intent Model for voice command detection.

Currently only English is supported.

The VOICE_ENABLED_UI NLP Intent Model supports a list of voice commands: "next", "back", "left", "right", "up", "down", "different", "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth".

Voice Navigation Command is different from Keyword. We don’t have to say the exact word to trigger the voice command. Take back as an example. We can say “go back”, “go to the previous one”, or “the one before” etc.

To define VOICE_ENABLED_UI NLP Intent Model:

//Command
let navigationNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
  'next',
  'back',
  'left',
  'right',
  'up',
  'down',
  'different',
  'first',
  'second',
  'third',
  'fourth',
  'fifth',
  'sixth',
  'seventh',
  'eighth',
  'ninth',
  'tenth',
];

Here we can also send a shorter list of intents to enable a smaller command set. For example if we only need next and back.

navigationNlpIntentModel.possibleIntents = ['next', 'back'];

To add NLP Intent Model to options:

options.nlpModels = [navigationNlpIntentModel];

We can use NLPKeywordModel and NLPIntentModel at the same time as needed.

options.nlpModels = [nlpKeywordModel, navigationNlpIntentModel];

Check out how to trigger visual effects when navigation voice commands are detected in Voice UI example!

Emotion Classifier

The EMOTION_CLASSIFIER NLP Intent Model supports 6 categories: anger, disgust, fear, joy, sadness, surprise.

To define EMOTION_CLASSIFIER NLP Intent Model:

let emotionNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');

To add "EMOTION_CLASSIFIER" NLP Intent Model to options:

options.nlpModels = [emotionNlpIntentModel];

We can use multiple NLPIntentModels at the same time as needed.

options.nlpModels = [navigationNlpIntentModel, emotionNlpIntentModel];

Yes No Classifier

The YES_NO_CLASSIFIER NLP Intent Model supports two categories: "positive_intent", "negative_intent".

To define YES_NO_CLASSIFIER NLP Intent Model:

let yesNoNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');

To add YES_NO_CLASSIFIER NLP Intent Model to options:

options.nlpModels = [
  navigationNlpIntentModel,
  emotionNlpIntentModel,
  yesNoNlpIntentModel,
];

Check out how to trigger visual effects when emotions or positive and negative intents are detected in the Sentiment Analyzer Example!

To parse the Command Responses, we can use parseCommandResponses function:

let parseCommandResponses = function (commandResponses) {
  let commands = [];
  let code = '';
  for (let iIterator = 0; iIterator < commandResponses.length; iIterator++) {
    let commandResponse = commandResponses[iIterator];
    switch (commandResponse.status.code) {
      case 0:
        code = 'OK';
        let command = commandResponse.intent;
        commands.push(command);
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            commandResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return commands;
};

We will get the command results in the onUpdateListeningEventHandler. Add the following code to the onUpdateListeningEventHandler:

   //Command Results
   let commandResponses = eventArgs.getIntentResponses();
   let commands = parseCommandResponses(commandResponses);
   if(commands.length > 0 )
   {
       let commandResponseText = "";
       for (let iIterator=0;iIterator<commands.length;iIterator++) {
           commandResponseText += commands[iIterator]+"\n";
    //code to run
    ...
       }
       print("Commands: " + commandResponseText);
   }

Error Code for Command Responses

There are few error codes which NLP models (either keyword or command detection) might return:

#SNAP_ERROR_INDECISIVE: if no command detected
#SNAP_ERROR_NONVERBAL: if we don’t think the audio input was really a human talking
#SNAP_ERROR_SILENCE: if too long silence
Anything starting with #SNAP_ERROR_: Errors that are not currently defined in this document and should be ignored

We can only detect one voice command for a NLPIntentModel each time.

We will reuse the getErrorMessage function here. And add error checking to parseCommandResponses function.

let parseCommandResponses = function (commandResponses) {
  let commands = [];
  let code = '';
  for (let iIterator = 0; iIterator < commandResponses.length; iIterator++) {
    let commandResponse = commandResponses[iIterator];
    switch (commandResponse.status.code) {
      case 0:
        code = 'OK';
        let command = commandResponse.intent;
        ///////New code for Error Checking///////
        if (command.includes('#SNAP_ERROR')) {
          let errorMessage = getErrorMessage(command);
          print('Command Error: ' + errorMessage);
          break;
        }
        /////////////////////////////////////
        commands.push(command);
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            commandResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return commands;
};

We also recommend ignoring all return values starting with #SNAP_ERROR as Snap is likely to add additional error code in the future.

Reset Preview. Try to say any Navigation voice command with the Microphone button enabled in the Preview panel. We can then see the voice command results in the Logger.

If the command result includes #SNAP_ERROR, try to repeat the command or add the command to the Speech Context for better results.

Click to see the full script

const voiceMLModule = require("LensStudio:VoiceMLModule");
let options = VoiceML.ListeningOptions.create();
options.speechRecognizer = VoiceMLModule.SpeechRecognizer.Default;
//Language Option
// Language Code: "en_US", "es_MX", "de_DE"
options.languageCode = 'en_US';
//General Option
options.shouldReturnAsrTranscription = true;
options.shouldReturnInterimAsrTranscription = true;
//Speech Context
let phrasesOne = ['yellow'];
let boostValueOne = 5;
options.addSpeechContext(phrasesOne, boostValueOne);
//Keyword
let keywordOne = 'yellow';
let categoryAliasesOne = ['yellow', 'maize', 'orange'];
let keywordTwo = 'black';
let categoryAliasesTwo = ['black'];
//NLPKeywordModel
let nlpKeywordModel = VoiceML.NlpKeywordModelOptions.create();
nlpKeywordModel.addKeywordGroup(keywordOne, categoryAliasesOne);
nlpKeywordModel.addKeywordGroup(keywordTwo, categoryAliasesTwo);
//Command
let navigationNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('VOICE_ENABLED_UI');
navigationNlpIntentModel.possibleIntents = [
  'Next',
  'Back',
  'Left',
  'Right',
  'Up',
  'Down',
  'Different',
  'First',
  'Second',
  'Third',
  'Fourth',
  'Fifth',
  'Sixth',
  'Seventh',
  'Eighth',
  'Ninth',
  'Tenth',
];
let emotionNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('EMOTION_CLASSIFIER');
let yesNoNlpIntentModel =
  VoiceML.NlpIntentsModelOptions.create('YES_NO_CLASSIFIER');
options.nlpModels = [
  nlpKeywordModel,
  navigationNlpIntentModel,
  emotionNlpIntentModel,
  yesNoNlpIntentModel,
];
let onListeningEnabledHandler = function () {
  voiceMLModule.startListening(options);
};
let onListeningDisabledHandler = function () {
  voiceMLModule.stopListening();
};
let getErrorMessage = function (response) {
  let errorMessage = '';
  switch (response) {
    case '#SNAP_ERROR_INDECISIVE':
      errorMessage = 'indecisive';
      break;
    case '#SNAP_ERROR_NONVERBAL':
      errorMessage = 'non verbal';
      break;
    case '#SNAP_ERROR_SILENCE':
      errorMessage = 'too long silence';
      break;
    default:
      if (response.includes('#SNAP_ERROR')) {
        errorMessage = 'general error';
      } else {
        errorMessage = 'unknown error';
      }
  }
  return errorMessage;
};
let parseKeywordResponses = function (keywordResponses) {
  let keywords = [];
  let code = '';
  for (let kIterator = 0; kIterator < keywordResponses.length; kIterator++) {
    let keywordResponse = keywordResponses[kIterator];
    switch (keywordResponse.status.code) {
      case 0:
        code = 'OK';
        for (
          let keywordsIterator = 0;
          keywordsIterator < keywordResponse.keywords.length;
          keywordsIterator++
        ) {
          let keyword = keywordResponse.keywords[keywordsIterator];
          if (keyword.includes('#SNAP_ERROR')) {
            let errorMessage = getErrorMessage(keyword);
            print('Keyword Error: ' + errorMessage);
            break;
          }
          keywords.push(keyword);
        }
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            keywordResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return keywords;
};
let parseCommandResponses = function (commandResponses) {
  let commands = [];
  let code = '';
  for (let iIterator = 0; iIterator < commandResponses.length; iIterator++) {
    let commandResponse = commandResponses[iIterator];
    switch (commandResponse.status.code) {
      case 0:
        code = 'OK';
        let command = commandResponse.intent;
        if (command.includes('#SNAP_ERROR')) {
          let errorMessage = getErrorMessage(command);
          print('Command Error: ' + errorMessage);
          break;
        }
        commands.push(commandResponse.intent);
        break;
      case 1:
        code = 'ERROR';
        print(
          'Status Code: ' +
            code +
            ' Description: ' +
            commandResponse.status.code.description
        );
        break;
      default:
        print('Status Code: No Status Code');
    }
  }
  return commands;
};
let onUpdateListeningEventHandler = function (eventArgs) {
  if (eventArgs.transcription.trim() == '') {
    return;
  }
  print('Transcription: ' + eventArgs.transcription);
  if (!eventArgs.isFinalTranscription) {
    return;
  }
  print('Final Transcription: ' + eventArgs.transcription);
  //Keyword Results
  let keywordResponses = eventArgs.getKeywordResponses();
  let keywords = parseKeywordResponses(keywordResponses);
  if (keywords.length > 0) {
    let keywordResponseText = '';
    for (let kIterator = 0; kIterator < keywords.length; kIterator++) {
      keywordResponseText += keywords[kIterator] + '\n';
    }
    print('Keywords:' + keywordResponseText);
  }
  //Command Results
  let commandResponses = eventArgs.getIntentResponses();
  let commands = parseCommandResponses(commandResponses);
  if (commands.length > 0) {
    let commandResponseText = '';
    for (let iIterator = 0; iIterator < commands.length; iIterator++) {
      commandResponseText += commands[iIterator] + '\n';
    }
    print('Commands: ' + commandResponseText);
  }
};
let onListeningErrorHandler = function (eventErrorArgs) {
  print(
    'Error: ' + eventErrorArgs.error + ' desc: ' + eventErrorArgs.description
  );
};
//VoiceML Callbacks
voiceMLModule.onListeningUpdate.add(onUpdateListeningEventHandler);
voiceMLModule.onListeningError.add(onListeningErrorHandler);
voiceMLModule.onListeningEnabled.add(onListeningEnabledHandler);
voiceMLModule.onListeningDisabled.add(onListeningDisabledHandler);

Limitations

Transcription and Keyword Detection are available for the English, Spanish and German languages.

Voice Command Detection is only available for the English language.

Transcription limitations include such things as new names for things, slang words or acute accents.

Please keep the following in mind when working with Sentiment Analyzer:

Do not play sound or speech from the Lens while activating the microphone to capture sound.
Try to avoid background noise and far device distance while activating the microphone to capture sound.
If the microphone is muted for more than two minutes, the transcription won't continue after unmuting, and you’ll need to reset the Preview panel to enable it.
If you were previously logged on to MyLenses and are having trouble seeing the preview in Lens Studio, logout from MyLenses and login to MyLenses again.

Here is how to logout and login to MyLenses.

Previewing Your Lens

You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.

Once the Lens is pushed to Snapchat, you will see hint: TAP TO TURN ON VOICE CONTROL. Tap the screen to start VoiceML and the OnListeningEnabled will be triggered. Press the button again to stop VoiceML and the OnListeningDisabled will be triggered.

Was this page helpful?

Yes

VoiceML​

Overview​

Scripting Example​

Listening Options​

Start and Stop Listening​

Adding Callbacks​

Transcription​

Speech Context​

Keyword Detection​

Error Codes for Keyword Responses​

Voice Command Detection​

Voice Navigation Command​

Emotion Classifier​

Yes No Classifier​

Error Code for Command Responses​

Limitations​

Previewing Your Lens​