Skip to main content
Version: 5.x

Speech Recognition

Speech Recognition is available in Lens Studio Asset Library. Import the asset to your project, create new Orthographic camera and place the prefab under it.

The Speech Recognition template demonstrates how you can use the Speech Recognition to incorporate transcription and keyword detection as well as voice navigation command detection based on basic natural language understanding into the Lenses. The template contains several helpers that you can use to create voice experiences without scripting.

To get to know more about Speech Recognition, please check out Speech Recognition Guide to see more detailed explanations about the concepts and scripting. To learn more about the voice ui, please check out Voice UI Template.

Guide

The template has two different examples shows how to use Speech Recognition:

  • Speech Transcription Example: transcribes speech and returns live and final transcription results.
  • Keyword Detection Example: allows users to define a list of keywords and detects based on Keyword Classifier keywords on top of transcription.
  • Transcription and keyword detection are now available for the English, Spanish, French and German languages. Transcription limitations include for example new names for things, slang words or acute accents.
  • Do not play sound or speech from the Lens while activating the microphone to capture sound.
  • Try to avoid background noise and far device distance while activating the microphone to capture sound.
  • If the microphone is muted for more than two minutes, the transcription won't continue after unmuting, and you’ll need to reset the Preview panel to enable it.
  • If you were previously logged on to MyLenses and are having trouble seeing the preview in Lens Studio, logout from MyLenses and login to MyLenses again.

Here is how to logout and login to MyLenses.

When we open the template, we can find them in the Scene Hierarchy panel.

VoiceML Module

The main asset used for Speech Recognition is VoiceML Module. We can find it in the Asset Browser panel. We attach it to the scripts in each example to configure settings for Speech Recognition.

In the bottom of the Preview panel, click the microphone. Test with your voice to see the blue vertical volume meter in action to ensure you are not muted. Then try speaking anything to see the transcription and scene object react to the voice events.

This template comes with a preview video named Preview with Audio. This preview video contains audio for Lens Studio to run VoiceML on so that you can test the template without using your microphone. However, the audio itself will not be played to your computer's speaker.

Transcription  Example

Now let’s take a look at the transcription example. In this example, we can transcribe speech and render live and final transcription results with the Screen Text Objects. Also, we will have a voice reactive 2D Listening Animation as an example to show how to trigger visuals with Voice Events.

If the voice event - On Listening Enabled is successfully called. We can see the Listening Icon pop up. Now try to speak to the microphone. The icon will animate when in listening mode. Animation will pause when getting the final transcription results. And turn red when getting an error.

Click on Transcription Example in the Scene Hierarchy panel. In the Inspector panel, we can then see here we are using SpeechRecognition.js in the Script component. This is the main script we are going to use for all the examples. Now let’s go through the details.

Notice here for the first section, we attach the VoiceML Module to the Speech Recognition Script Component for Speech Recognition configuration and voice input in Lens Studio.

Basic Setting for Transcription

Now let’s go through some basic settings for transcription in the next section.

  • Transcription: final transcription.
  • Live Transcription: live and slightly less accurate transcription before we get the final, more accurate transcription.

Try to change the setting to see the difference.

Transcription Text

Here you can set the transcription result to a Screen Text with Transcription Text enabled.

In this example, the Screen Text object is under Orthographic Camera->Transcription Example UI->Transcription Text.

Speech Context

We can also add speech contexts to the transcription and boost some of the words for specific transcription scenarios. Use this when transcribing words which are rarer and aren’t picked up well enough by Snap, the higher the boost value will be, the more likely the word to appear in transcription.

With useSpeechContext setting enabled, we can then attach the Speech Contexts object to it.

In the Scene Hierarchy panel, click on the Transcription Example -> Speech Contexts object. In the Inspector panel, we can see here we have a Speech Context Script Component attached to the object!

Add New Phrase to Speech Context

With Speech Context Script Component, we can add words to the phrases and set a boost value for the phrases. To add a new word, click on the Add Value field and input a new word you want to add.

Notice here the phrases should be made of lowercase a-z letters. The phrases should be within the vocabulary.

Out of Vocabulary

When an OOV(out of vocabulary) phrase is added to the Speech Context, the Voice Event - On Error Triggered will be triggered. We will see the error message in the Logger. Here we take a random word “az zj” as an example.

Try this by resetting the Lens in the Preview Panel, then speaking with the Microphone button enabled. We can then see the error message in the Logger.

Add New Speech Context

Or we can add a new Speech Context Script Component with a different boost value.

The range for boost value is from 1-10, we recommend you’ll start with 5 and adjust if needed (the higher the value is, the more likely the word will appear in transcription)

Voice Events

We will skip the Use Keyword and Use Command sections for now. Let's take a look at Edit Behaviors. Notice here if we enable Edit Behaviors, we can then attach Behavior Scripts to different voice events and trigger different visuals. Here we have 5 different voice events.

  • On Listening Enabled: Trigger when the microphone is enabled.
  • On Listening Disabled: Trigger when the microphone is disabled.
  • On Listening Triggered: Trigger when changed back to listening mode.
  • On Error Triggered: Trigger when there is error in the transcription.
  • On Final Transcription Triggered: Trigger when it's a full result, or partial transcription.

In Lens Studio Preview the microphone button is not simulated on the Screen. When we reset the preview, once the Speech Recognition is initialized, On Listening Enabled will be triggered automatically. To learn how to use the microphone button, try to preview the Lens in Snapchat! Please check the details in Previewing Your Lens.

Debug Message for Voice Events

Notice here if we enable the Debug in Voice Event Callbacks. In the Logger, we can then see the Voice Event being printed out when it gets triggered.

Send Behavior Triggers with Voice Events

For each Voice Event Callback, we can assign multiple Behavior Scripts. Try to click on the Add Value field to attach new Behavior Scripts.

Here in this example, we are attaching different behavior scripts to change the visuals for the listening icon and screen texts.

Behavior

In the Scene Hierarchy panel, we have a Transcription Example->Behaviors object. It has all the behavior scripts used in this example. Click on each object. In the Inspector panel, we can then see the details of the Behavior Script.

Take the On Listening Enabled Behavior Object as an example, we are using Behavior Script to enable Listening Icon.

Keyword Detection Example

Now that you have learned how to use the basic transcription. Let’s take a look at the second example, Keyword Detection. Here we can trigger behaviors based on different keywords detected from the transcription!

Let’s disable the first example, enable the Keyword Detection Example, and Reset Preview!

If the voice event - On Listening Enabled is successfully called. We can see the Listening Icon pop up. Now try to speak to the microphone. With this example we can say:

  • “hungry/starve/I am hungry” to enable the food objects.
  • “breakfast” / “I had breakfast” / “I went out for breakfast” to trigger a VFX effect with any food texture.
  • “soup”/“dip” to trigger a VFX effect with soup texture
  • “cookie”/“I eat cookie this morning” to trigger a VFX effect with cookie texture
  • etc.

Now click on the Keyword Detection Example. In the Inspector panel, we can see here we continue to use the Speech Recognition Script Component. Since the keyword detection is based on the transcription, we can see here we choose the Transcription setting for transcription.

We can also enable the Live Transcription and Transcription Text as needed!

With Live Transcription setting, we will get a faster response. Without Transcription setting, we can only get the keyword response when it is the final transcription with a higher accuracy.  

Speech Context

Click on the Keyword Detection Example->SpeechContexts. Notice here we are using a different list of Speech Contexts. Boost words like “hungry”/“starve”/“soup”/“cookie” etc. which will be used in Keyword Detection!

Use Keyword

Now goes back to the Speech Recognition Script Component. Here we can see Use Keyword is enabled for keyword detection!

Keywords Parent Object

Notice here we need to attach a scene object as the Keywords Parent Object - Keyword Detection Example -> Keywords Object.

Keyword

Keywords Object has a list of children. Each of them has a Keyword Script Component. Click on each child object. In the  Inspector panel, we can see the details of the Keyword Script Component. With Keyword Script Component we can then configure basic settings for Keyword.

Notice here any enabled objects under Keywords Object with Keyword Script Component attached will be added as a keyword.

Alias

Here let’s take the keyword “Hungry” as an example. We can define the keyword, “Hungry”. Then we need to define a list of aliases for “Hungry”. Here we have “hungry”, “starve” and “I am hungry”. If these aliases are in the transcription results, the keyword “Hungry” will be detected. Aliases give us the ability to expand the subset of words that should return Hungry if needed to serve a specific Lens experience.

When using keyword detection, VoiceML will try to try to mitigate small transcription errors such as plurals instead of singular or similar sounding words (ship/sheep etc), instead, use multiple keywords to think about how different users might say the same thing in different ways like “cinema”, “movies”, “film”.  

Notice here we can use more than one word - e.g. - short phrases or sentences for aliases.

Add New Alias

To add a new alias, click on the Add Value field.

Add New Keyword

To add a new keyword, in the Scene Hierarchy panel, duplicate any keyword object under Keywords and modify the settings. Or add a new scene object under Keywords Object and attach Keyword.js to the Script Component.

Send Behavior Triggers with Keyword Detection

In the next section, with Send Triggers enabled, we can then attach multiple Behavior Scripts to the On Keyword Triggered. If a keyword is detected, all the Behaviors attached to the Keyword will be triggered. To attach a new Behavior Script, click on the Add Value field.

Here let’s take the keyword “Hungry” as an example. When “Hungry” is detected, we will trigger Set Object Scale Behavior Script. In the Scene Hierarchy panel, click on the Keyword Detection Example->Behaviors->Set Object Scale Behavior Object.

Behavior and Tween

In the Inspector panel, let take a look at the first Script Component, we are using Behavior Script to trigger tween animation TWEEN_SETSCALE_UP - Keyword Detection Example->Head Binding->BitmojiFood->Second Script Component.

Here with TweenColor Script Component, we can scale up the food 3D model.

Error Code for Keyword Responses

There are few error codes which NLP models (either keyword or command detection) might return:

  • #SNAP_ERROR_INDECISIVE: if no keyword detected
  • #SNAP_ERROR_NONVERBAL: if we don’t think the audio input was really a human talking
  • #SNAP_ERROR_SILENCE: if too long silence
  • Anything starting with #SNAP_ERROR_: Errors that are not currently defined in this document and should be ignored

Several keywords can be detected in the same utterance but in this template only one random keyword will trigger a behavior .

Now try resetting the Preview panel and with the Microphone button enabled in the Preview panel, try to say the words which are not from the keyword list. We can then see the keyword error messages in the Logger.

Keyword Screen Text

After we set up all the keywords, now let’s get back to the Speech Recognition Script Component. With Keyword Text enabled, we will update the Screen Text Component under Orthographic Camera -> Keyword Detection UI -> NLP Keyword Text Object.

Voice Event - On Keyword Detected

In the example, when the keyword is detected, you might notice here we are also triggering two tween animations:

  1. TWEEN_SETSCREENTRANS_DETECTED - Orthographic Camera-> Keyword Detection UI - Safe Render Region -> NLP Keyword Text Object -> First Script Component.
  2. TWEEN_SETCOLOR_DETECTED - Orthographic Camera-> Keyword Detection UI - Safe Render Region -> NLP Keyword Text Object -> Second Script Component.
  3. And we also pause the Listening Icon animation.

Here we are using a new Voice Event - On Keyword Detected. Let’s go back to the main Speech Recognition Script Component under Keyword Detection Example Object.

Notice here, with Use Keyword and  Edit Voice Event Callback enabled, a new field will be added to the Voice Event Callbacks - On Keyword Detected. Same with other Voice Events, we can add a list of Behavior Scripts, which will be triggered when any keyword is detected.

Previewing Your Lens

You’re now ready to preview your Lens! To preview your Lens in Snapchat, follow the Pairing to Snapchat guide.

Once the Lens is pushed to Snapchat, you will see hint: TAP TO TURN ON VOICE CONTROL. Tap the screen to start VoiceML and the OnListeningEnabled will be triggered. Press the button again to stop VoiceML and the OnListeningDisabled will be triggered.

Was this page helpful?
Yes
No

AI-Powered Search