Google is committed to advancing racial equity for Black communities. See how.

Recognize Text in Images with Firebase ML on iOS

You can use Firebase ML to recognize text in images. Firebase ML has both a general-purpose API suitable for recognizing text in images, such as the text of a street sign, and an API optimized for recognizing the text of documents.

Before you begin

  1. If you have not already added Firebase to your app, do so by following the steps in the getting started guide.
  2. Include Firebase ML in your Podfile:
    pod 'Firebase/MLVision'
    
    After you install or update your project's Pods, be sure to open your Xcode project using its .xcworkspace.
  3. In your app, import Firebase:

    Swift

    import Firebase

    Objective-C

    @import Firebase;
  4. If you have not already enabled Cloud-based APIs for your project, do so now:

    1. Open the Firebase ML APIs page of the Firebase console.
    2. If you have not already upgraded your project to a Blaze plan, click Upgrade to do so. (You will be prompted to upgrade only if your project isn't on the Blaze plan.)

      Only Blaze-level projects can use Cloud-based APIs.

    3. If Cloud-based APIs aren't already enabled, click Enable Cloud-based APIs.

Now you are ready to start recognizing text in images.

Input image guidelines

  • For Firebase ML to accurately recognize text, input images must contain text that is represented by sufficient pixel data. Ideally, for Latin text, each character should be at least 16x16 pixels. For Chinese, Japanese, and Korean text, each character should be 24x24 pixels. For all languages, there is generally no accuracy benefit for characters to be larger than 24x24 pixels.

    So, for example, a 640x480 image might work well to scan a business card that occupies the full width of the image. To scan a document printed on letter-sized paper, a 720x1280 pixel image might be required.

  • Poor image focus can hurt text recognition accuracy. If you aren't getting acceptable results, try asking the user to recapture the image.


Recognize text in images

To recognize text in an image, run the text recognizer as described below.

1. Run the text recognizer

Pass the image as a UIImage or a CMSampleBufferRef to the VisionTextRecognizer's process(_:completion:) method:

  1. Get an instance of VisionTextRecognizer by calling cloudTextRecognizer:

    Swift

    let vision = Vision.vision()
    let textRecognizer = vision.cloudTextRecognizer()
    
    // Or, to provide language hints to assist with language detection:
    // See https://cloud.google.com/vision/docs/languages for supported languages
    let options = VisionCloudTextRecognizerOptions()
    options.languageHints = ["en", "hi"]
    let textRecognizer = vision.cloudTextRecognizer(options: options)
    

    Objective-C

    FIRVision *vision = [FIRVision vision];
    FIRVisionTextRecognizer *textRecognizer = [vision cloudTextRecognizer];
    
    // Or, to provide language hints to assist with language detection:
    // See https://cloud.google.com/vision/docs/languages for supported languages
    FIRVisionCloudTextRecognizerOptions *options =
            [[FIRVisionCloudTextRecognizerOptions alloc] init];
    options.languageHints = @[@"en", @"hi"];
    FIRVisionTextRecognizer *textRecognizer = [vision cloudTextRecognizerWithOptions:options];
    
  2. Create a VisionImage object using a UIImage or a CMSampleBufferRef.

    To use a UIImage:

    1. If necessary, rotate the image so that its imageOrientation property is .up.
    2. Create a VisionImage object using the correctly-rotated UIImage. Do not specify any rotation metadata—the default value, .topLeft, must be used.

      Swift

      let image = VisionImage(image: uiImage)

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithImage:uiImage];

    To use a CMSampleBufferRef:

    1. Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

      To get the image orientation:

      Swift

      func imageOrientation(
          deviceOrientation: UIDeviceOrientation,
          cameraPosition: AVCaptureDevice.Position
          ) -> VisionDetectorImageOrientation {
          switch deviceOrientation {
          case .portrait:
              return cameraPosition == .front ? .leftTop : .rightTop
          case .landscapeLeft:
              return cameraPosition == .front ? .bottomLeft : .topLeft
          case .portraitUpsideDown:
              return cameraPosition == .front ? .rightBottom : .leftBottom
          case .landscapeRight:
              return cameraPosition == .front ? .topRight : .bottomRight
          case .faceDown, .faceUp, .unknown:
              return .leftTop
          }
      }

      Objective-C

      - (FIRVisionDetectorImageOrientation)
          imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation
                                 cameraPosition:(AVCaptureDevicePosition)cameraPosition {
        switch (deviceOrientation) {
          case UIDeviceOrientationPortrait:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationLeftTop;
            } else {
              return FIRVisionDetectorImageOrientationRightTop;
            }
          case UIDeviceOrientationLandscapeLeft:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationBottomLeft;
            } else {
              return FIRVisionDetectorImageOrientationTopLeft;
            }
          case UIDeviceOrientationPortraitUpsideDown:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationRightBottom;
            } else {
              return FIRVisionDetectorImageOrientationLeftBottom;
            }
          case UIDeviceOrientationLandscapeRight:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationTopRight;
            } else {
              return FIRVisionDetectorImageOrientationBottomRight;
            }
          default:
            return FIRVisionDetectorImageOrientationTopLeft;
        }
      }

      Then, create the metadata object:

      Swift

      let cameraPosition = AVCaptureDevice.Position.back  // Set to the capture device you used.
      let metadata = VisionImageMetadata()
      metadata.orientation = imageOrientation(
          deviceOrientation: UIDevice.current.orientation,
          cameraPosition: cameraPosition
      )

      Objective-C

      FIRVisionImageMetadata *metadata = [[FIRVisionImageMetadata alloc] init];
      AVCaptureDevicePosition cameraPosition =
          AVCaptureDevicePositionBack;  // Set to the capture device you used.
      metadata.orientation =
          [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation
                                       cameraPosition:cameraPosition];
    2. Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

      Swift

      let image = VisionImage(buffer: sampleBuffer)
      image.metadata = metadata

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithBuffer:sampleBuffer];
      image.metadata = metadata;
  3. Then, pass the image to the process(_:completion:) method:

    Swift

    textRecognizer.process(visionImage) { result, error in
      guard error == nil, let result = result else {
        // ...
        return
      }
    
      // Recognized text
    }
    

    Objective-C

    [textRecognizer processImage:image
                      completion:^(FIRVisionText *_Nullable result,
                                   NSError *_Nullable error) {
      if (error != nil || result == nil) {
        // ...
        return;
      }
    
      // Recognized text
    }];
    

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return a VisionText object. A VisionText object contains the full text recognized in the image and zero or more VisionTextBlock objects.

Each VisionTextBlock represents a rectangular block of text, which contain zero or more VisionTextLine objects. Each VisionTextLine object contains zero or more VisionTextElement objects, which represent words and word-like entities (dates, numbers, and so on).

For each VisionTextBlock, VisionTextLine, and VisionTextElement object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

let resultText = result.text
for block in result.blocks {
    let blockText = block.text
    let blockConfidence = block.confidence
    let blockLanguages = block.recognizedLanguages
    let blockCornerPoints = block.cornerPoints
    let blockFrame = block.frame
    for line in block.lines {
        let lineText = line.text
        let lineConfidence = line.confidence
        let lineLanguages = line.recognizedLanguages
        let lineCornerPoints = line.cornerPoints
        let lineFrame = line.frame
        for element in line.elements {
            let elementText = element.text
            let elementConfidence = element.confidence
            let elementLanguages = element.recognizedLanguages
            let elementCornerPoints = element.cornerPoints
            let elementFrame = element.frame
        }
    }
}

Objective-C

NSString *resultText = result.text;
for (FIRVisionTextBlock *block in result.blocks) {
  NSString *blockText = block.text;
  NSNumber *blockConfidence = block.confidence;
  NSArray<FIRVisionTextRecognizedLanguage *> *blockLanguages = block.recognizedLanguages;
  NSArray<NSValue *> *blockCornerPoints = block.cornerPoints;
  CGRect blockFrame = block.frame;
  for (FIRVisionTextLine *line in block.lines) {
    NSString *lineText = line.text;
    NSNumber *lineConfidence = line.confidence;
    NSArray<FIRVisionTextRecognizedLanguage *> *lineLanguages = line.recognizedLanguages;
    NSArray<NSValue *> *lineCornerPoints = line.cornerPoints;
    CGRect lineFrame = line.frame;
    for (FIRVisionTextElement *element in line.elements) {
      NSString *elementText = element.text;
      NSNumber *elementConfidence = element.confidence;
      NSArray<FIRVisionTextRecognizedLanguage *> *elementLanguages = element.recognizedLanguages;
      NSArray<NSValue *> *elementCornerPoints = element.cornerPoints;
      CGRect elementFrame = element.frame;
    }
  }
}

Next steps


Recognize text in images of documents

To recognize the text of a document, configure and run the document text recognizer as described below.

The document text recognition API, described below, provides an interface that is intended to be more convenient for working with images of documents. However, if you prefer the interface provided by the sparse text API, you can use it instead to scan documents by configuring the cloud text recognizer to use the dense text model.

To use the document text recognition API:

1. Run the text recognizer

Pass the image as a UIImage or a CMSampleBufferRef to the VisionDocumentTextRecognizer's process(_:completion:) method:

  1. Get an instance of VisionDocumentTextRecognizer by calling cloudDocumentTextRecognizer:

    Swift

    let vision = Vision.vision()
    let textRecognizer = vision.cloudDocumentTextRecognizer()
    
    // Or, to provide language hints to assist with language detection:
    // See https://cloud.google.com/vision/docs/languages for supported languages
    let options = VisionCloudDocumentTextRecognizerOptions()
    options.languageHints = ["en", "hi"]
    let textRecognizer = vision.cloudDocumentTextRecognizer(options: options)
    

    Objective-C

    FIRVision *vision = [FIRVision vision];
    FIRVisionDocumentTextRecognizer *textRecognizer = [vision cloudDocumentTextRecognizer];
    
    // Or, to provide language hints to assist with language detection:
    // See https://cloud.google.com/vision/docs/languages for supported languages
    FIRVisionCloudDocumentTextRecognizerOptions *options =
            [[FIRVisionCloudDocumentTextRecognizerOptions alloc] init];
    options.languageHints = @[@"en", @"hi"];
    FIRVisionDocumentTextRecognizer *textRecognizer = [vision cloudDocumentTextRecognizerWithOptions:options];
    
  2. Create a VisionImage object using a UIImage or a CMSampleBufferRef.

    To use a UIImage:

    1. If necessary, rotate the image so that its imageOrientation property is .up.
    2. Create a VisionImage object using the correctly-rotated UIImage. Do not specify any rotation metadata—the default value, .topLeft, must be used.

      Swift

      let image = VisionImage(image: uiImage)

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithImage:uiImage];

    To use a CMSampleBufferRef:

    1. Create a VisionImageMetadata object that specifies the orientation of the image data contained in the CMSampleBufferRef buffer.

      To get the image orientation:

      Swift

      func imageOrientation(
          deviceOrientation: UIDeviceOrientation,
          cameraPosition: AVCaptureDevice.Position
          ) -> VisionDetectorImageOrientation {
          switch deviceOrientation {
          case .portrait:
              return cameraPosition == .front ? .leftTop : .rightTop
          case .landscapeLeft:
              return cameraPosition == .front ? .bottomLeft : .topLeft
          case .portraitUpsideDown:
              return cameraPosition == .front ? .rightBottom : .leftBottom
          case .landscapeRight:
              return cameraPosition == .front ? .topRight : .bottomRight
          case .faceDown, .faceUp, .unknown:
              return .leftTop
          }
      }

      Objective-C

      - (FIRVisionDetectorImageOrientation)
          imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation
                                 cameraPosition:(AVCaptureDevicePosition)cameraPosition {
        switch (deviceOrientation) {
          case UIDeviceOrientationPortrait:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationLeftTop;
            } else {
              return FIRVisionDetectorImageOrientationRightTop;
            }
          case UIDeviceOrientationLandscapeLeft:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationBottomLeft;
            } else {
              return FIRVisionDetectorImageOrientationTopLeft;
            }
          case UIDeviceOrientationPortraitUpsideDown:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationRightBottom;
            } else {
              return FIRVisionDetectorImageOrientationLeftBottom;
            }
          case UIDeviceOrientationLandscapeRight:
            if (cameraPosition == AVCaptureDevicePositionFront) {
              return FIRVisionDetectorImageOrientationTopRight;
            } else {
              return FIRVisionDetectorImageOrientationBottomRight;
            }
          default:
            return FIRVisionDetectorImageOrientationTopLeft;
        }
      }

      Then, create the metadata object:

      Swift

      let cameraPosition = AVCaptureDevice.Position.back  // Set to the capture device you used.
      let metadata = VisionImageMetadata()
      metadata.orientation = imageOrientation(
          deviceOrientation: UIDevice.current.orientation,
          cameraPosition: cameraPosition
      )

      Objective-C

      FIRVisionImageMetadata *metadata = [[FIRVisionImageMetadata alloc] init];
      AVCaptureDevicePosition cameraPosition =
          AVCaptureDevicePositionBack;  // Set to the capture device you used.
      metadata.orientation =
          [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation
                                       cameraPosition:cameraPosition];
    2. Create a VisionImage object using the CMSampleBufferRef object and the rotation metadata:

      Swift

      let image = VisionImage(buffer: sampleBuffer)
      image.metadata = metadata

      Objective-C

      FIRVisionImage *image = [[FIRVisionImage alloc] initWithBuffer:sampleBuffer];
      image.metadata = metadata;
  3. Then, pass the image to the process(_:completion:) method:

    Swift

    textRecognizer.process(visionImage) { result, error in
      guard error == nil, let result = result else {
        // ...
        return
      }
    
      // Recognized text
    }
    

    Objective-C

    [textRecognizer processImage:image
                      completion:^(FIRVisionDocumentText *_Nullable result,
                                   NSError *_Nullable error) {
      if (error != nil || result == nil) {
        // ...
        return;
      }
    
        // Recognized text
    }];
    

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return a VisionDocumentText object. A VisionDocumentText object contains the full text recognized in the image and a hierarchy of objects that reflect the structure of the recognized document:

For each VisionDocumentTextBlock, VisionDocumentTextParagraph, VisionDocumentTextWord, and VisionDocumentTextSymbol object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

let resultText = result.text
for block in result.blocks {
    let blockText = block.text
    let blockConfidence = block.confidence
    let blockRecognizedLanguages = block.recognizedLanguages
    let blockBreak = block.recognizedBreak
    let blockCornerPoints = block.cornerPoints
    let blockFrame = block.frame
    for paragraph in block.paragraphs {
        let paragraphText = paragraph.text
        let paragraphConfidence = paragraph.confidence
        let paragraphRecognizedLanguages = paragraph.recognizedLanguages
        let paragraphBreak = paragraph.recognizedBreak
        let paragraphCornerPoints = paragraph.cornerPoints
        let paragraphFrame = paragraph.frame
        for word in paragraph.words {
            let wordText = word.text
            let wordConfidence = word.confidence
            let wordRecognizedLanguages = word.recognizedLanguages
            let wordBreak = word.recognizedBreak
            let wordCornerPoints = word.cornerPoints
            let wordFrame = word.frame
            for symbol in word.symbols {
                let symbolText = symbol.text
                let symbolConfidence = symbol.confidence
                let symbolRecognizedLanguages = symbol.recognizedLanguages
                let symbolBreak = symbol.recognizedBreak
                let symbolCornerPoints = symbol.cornerPoints
                let symbolFrame = symbol.frame
            }
        }
    }
}

Objective-C

NSString *resultText = result.text;
for (FIRVisionDocumentTextBlock *block in result.blocks) {
  NSString *blockText = block.text;
  NSNumber *blockConfidence = block.confidence;
  NSArray<FIRVisionTextRecognizedLanguage *> *blockRecognizedLanguages = block.recognizedLanguages;
  FIRVisionTextRecognizedBreak *blockBreak = block.recognizedBreak;
  CGRect blockFrame = block.frame;
  for (FIRVisionDocumentTextParagraph *paragraph in block.paragraphs) {
    NSString *paragraphText = paragraph.text;
    NSNumber *paragraphConfidence = paragraph.confidence;
    NSArray<FIRVisionTextRecognizedLanguage *> *paragraphRecognizedLanguages = paragraph.recognizedLanguages;
    FIRVisionTextRecognizedBreak *paragraphBreak = paragraph.recognizedBreak;
    CGRect paragraphFrame = paragraph.frame;
    for (FIRVisionDocumentTextWord *word in paragraph.words) {
      NSString *wordText = word.text;
      NSNumber *wordConfidence = word.confidence;
      NSArray<FIRVisionTextRecognizedLanguage *> *wordRecognizedLanguages = word.recognizedLanguages;
      FIRVisionTextRecognizedBreak *wordBreak = word.recognizedBreak;
      CGRect wordFrame = word.frame;
      for (FIRVisionDocumentTextSymbol *symbol in word.symbols) {
        NSString *symbolText = symbol.text;
        NSNumber *symbolConfidence = symbol.confidence;
        NSArray<FIRVisionTextRecognizedLanguage *> *symbolRecognizedLanguages = symbol.recognizedLanguages;
        FIRVisionTextRecognizedBreak *symbolBreak = symbol.recognizedBreak;
        CGRect symbolFrame = symbol.frame;
      }
    }
  }
}

Next steps