Skip to main content
Version: 4.55.1

Optical Character Recognition

Through Snap’s partnership with OpenCV, we are bringing to you a training notebook that allows you to train ML Models that can be later brought into Lens Studio and used to recognize text in the camera view. This unlocks new unconventional triggers for Lens experiences as well as empowers Lens Developers to build engaging utility Lenses.

Text Recognition Overview

Optical Character Recognition (OCR) is a critical component in Computer Vision and Artificial Intelligence. Its applications span a variety of sectors, including document digitization, automated data entry, and license plate recognition. Unlike typical image processing tasks, OCR identifies, extracts, and digitizes written or printed characters from images or documents. This technology bridges the gap between the physical text world and the digital realm, allowing computers to understand and utilize written text within images.

Training Code

This Inference Notebook shows you how to convert and bring powerful and lightweight OCR model Paddle-OCR and deploy it using SnapML.

We've also prepared a separate notebook with instructions on launching the PaddleOCR training scripts.

We didn't train the model on our side and reproduce the authors' instructions there; therefore, we can't guarantee that the results will match the authors' pre-trained model we used.

You may run provided notebooks in Google Colab. Select Upload Notebook in the top menu and upload notebook files or provide a github link.

The snapml-templates repo contains these and several other training notebooks.

Template Walkthrough

Optical Character Recognition Script

The main functionality is encapsulated into the OpticalCharacterRecognition.js script, that takes care of building and configuring ml models and provides public api for other scripts.

Right click and export OCR Controller from the Objects panel to bring into another project and start building your own experience. You can refer to the API section to see how to use it.

Let’s take a look at the overall process of text recognition that happens under the hood, which will help you understand script inputs and API functions.

Character Detection Overview

First, the Detector model is run on the input image to classify each pixel of it and determine whether it belongs to the text block or not. This produces something similar to a segmentation mask, that is processed with JavaScript to calculate the bounding boxes of the lines of text. This process is relatively cheap and allows us to detect text on every frame in real time.

At the next step, the input texture is being cropped around each of detected rectangles and passed to the second machine learning model that allows to classify characters and output them as a list of strings.

This process is a bit more costly, since the ML model runs for every cropped rectangle.

You can still experiment with running it on every frame but consider limiting the amount of detections.

Script Inputs

  • ML Model: a ML asset exported from the training notebook.
  • Input Texture: input texture processed by detector and classifier.
  • Max Detections: maximum amount of blocks of text that can be detected.
  • Confidence: a threshold probability of a detection.
  • Extend Width: multiplier to extend the rectangle on the long side.
  • Extend Height: multiplier to extend the rectangle on the short side.

The detection model returns a tight bounding box around the text line. It may lead to incorrect recognition results. Snapchat uses Extend Width and Extend Height input parameters to extend the rectangle on the long and short sides respectively.

  • Min Side: used to filter bounding boxes which are too small based on the short side length. Units of this parameter are based on the detector output size, in our case it is 640 by 640 px.
  • Ml Model: Machine Learning model for character recognition.
  • Crop Texture: a screen crop texture used for cropping a piece of input texture with a text line on it.
  • Confidence: a probability threshold for character recognition.

Script API

This script is initialized on awake and provides a set of api functions, such as

// Runs detector ml model immediately
// returns array of detected boxes
script.getDetectionBoxes();

// Get text on an input texture in the provided rectangles
// Returns list of text labels
script.getDetectedText(rects);

// Set Input Texture for processing
script.setInputTexture(texture);

// Get Input Texture that is being processed
script.getInputTexture();

// Whether ml component are ready to process data
script.isInitialized();

Before diving into all the buttons and panels, you should take a look at the simplest example of api usage.

  1. Find the SimpleExample.js script in the Resources panel and drag it to the Objects panel. This will create a Scene Object with the mentioned script attached:
// Simple Example.js
// Version: 1.0.0
// Event: On Awake
// Description: Simplest example of the OCR controller api usage

// @input Component.ScriptComponent ocrController {"label" : "OCR Controller"}
// @ui {"widget":"separator"}
//@input Asset.Texture inputTexture

script.run = function () {
if (!script.ocrController.isInitialized()) {
print('OCR controller is not initialized yet');
return;
}
// set input texture to process
// for example device texture or an image picker texture
script.ocrController.setInputTexture(script.inputTexture);

//get rectangles of detected lines of text
let rects = script.ocrController.getDetectionBoxes();
//get text
let lines = script.ocrController.getDetectedText(rects);
// print results
for (var i = 0; i < rects.length; i++) {
Studio.log(
i + '. Text: "' + lines[i] + '"' + ', Detected Rectangle ' + rects[i]
);
}
};

script.createEvent('TapEvent').bind(script.run);
  1. Set script inputs with Optical Character Recognition script reference and input texture of your choice.

  2. If you tap on the screen, you should output results like this:

Customizing Template

By this point, you should know enough to start building your own OCR Lens. In the next few sections, you will see some of the best practices when working on building OCR powered Lenses

Suggested UX Flow

Using the api described in the previous section and some of the UI Widget Custom Components have allowed our team to create a more robust user flow for OCR. You can take a look at the main steps in experience:

The expected user flow it:

  1. We calculate detection boxes on every frame on an input texture.

    1. User has an option to select image from Media Picker
  2. When a user clicks on a Recognize button, we call getRecognizedText api and display detected text.

  3. Users can go back or click the Edit button to open the scroll view with the aggregated editable text for easier reading or manipulation.

MainController.js

MainController.js script takes care of the main UI flow of the Lens, as well as calls api functions of the OpticalCharacterRecognition controller described above.

There are a bunch of inputs located under the Edit Connections tab, such as;

  • Detection Screen: a scene object that contains a ui shown when we run a detection model.
  • Recognition Screen: a scene object that contains a ui shown when we run a recognition model.
  • Edit Screen: a scene object that contains a ui shown to display result text. It contains a Scroll View component to allow scrolling text and supports Native Keyboard.

This script also provides a bunch of api functions that are called when you press on UI Buttons included in each of the UI screens mentioned above.

script.showDetectionUI = () => setUIState(State.Detection);

script.showRecognitionUI = () => setUIState(State.Recognition);

script.showEditUI = () => setUIState(State.Edit);

script.detect = detect;

script.recognize = recognize;

For example Recognize Button located under the Recognition Screen scene object calls two apo functions of the Main Controller script:

It also controls the graphical representations of the detections, by instantiating and placing specific display objects over the detected text rectangle.

  • OCR Controller: a reference to the OpticalCharacterRecognition script component.
  • Display Object: object to instantiate and place over the detected text rectangle.
  • Amount: max amount of the objects to display.

This template is using Overlay Camera and Overlay render target to render a quality UI element in the Lens that does not get captured on a final snap.

Detection Display

You can modify example representation of detections

We use this hierarchy set up to make sure the input image extends Container Scene object (using Extents Target property of a screen transform) so detections are placed precisely even if aspect ratio of the input image is different (for example for custom texture or image from image picker).

Modify Children of the Detection Display scene object to customize look, for example material on the Image. Screen transform offsets or Text component settings.

DetectionDisplayController.js represents a single detection display. This script is coupled with MainController and provides several api functions used by it, such as:

script.show = show;
script.hide = hide;
script.setRectangle = setRectangle;
script.setText = setText;

While keeping these functions exposed to script api, you may modify this script to your liking and create a whole different visuals and behaviors for your detections.

Image Picker Functionality

By default the template is processing Device Camera Texture, but there is an Image Picker button on the Recognition screen that allows to override input texture and grab an image from the camera roll to detect text on it.

You can enable or disable the Image Picker Controller scene object since it has all functionality encapsulated under it.

ImagePickerButtonHelper.js script is using OCR controller api function called setTexture to override input to Ml Models.

OCR Controller: a reference to the OpticalCharacterRecognition script.

Input Texture Image: this is an image used to display input texture. We need this to display the original texture under the detections display, as well as to make sure the parent screen transform has a correct aspect ratio. Input Texture image in this case makes sure its child screen transform (Called Container) is exactly the same size as image texture based on its aspect and fill mode (by utilizing Extents Target property of an image)

Extents target property is pretty useful. Check out the Expanding Text Input asset on the Asset LIbrary for more use cases.

  • Picker Button: UI Button that switches OCR controller to process media picker texture
  • Media Picker: Media Picker Texture.
  • Camera Feed Button: UI Button that switches OCR controller to process device camera texture

Similarly as MainController this script also provides api functions, called by UI Buttons:

script.setDeviceTextureInput = setDeviceTextureInput; script.setImagePickerInput = setImagePickerInput;

Expectations

  • Recognition works for English text only.
  • Works well on full frontal views with very little obstruction in text, where text is horizontally placed. License plates are a good example where it works well, front facing billboards, book covers with horizontal text are good too.
  • Doesn’t work well on use cases where the text is slanted (for example road signs that may be angled, or a book cover that is placed on a table on an angle etc).
  • Text that is very exaggerated or highly stylized is also one where it doesn’t work very well. If there are ornaments near the text, it tends to capture those as well and try to recognize them as special characters.
  • Sometimes, the model is not able to recognize spaces, leading to confusing outputs.
  • Big chunks of small text are also a hit or miss.

This template cannot ne used with Remote API at the moment because of privacy restrictions

Please refer to the guides below for additional information:

Was this page helpful?
Yes
No