Apple Vision framework – Text extraction from image

Apple Vision

Text Extraction

Image Processing

OCR Technology

iOS Development

Apple Vision framework – Text extraction from image

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Apple's Vision framework gives iOS and macOS apps a built-in way to perform OCR, meaning text recognition from images, without sending every image to a remote service. For most modern Apple-platform apps, the standard entry point is VNRecognizeTextRequest.

The workflow is straightforward once the pieces are clear: provide an image, create a recognition request, run it through a request handler, and read the recognized strings from the resulting observations. The real work is in handling orientation, tuning accuracy, and processing results in a way that fits your app.

The Core OCR Flow in Vision

Vision text extraction usually follows these steps:

obtain an image as CGImage, CIImage, or pixel buffer
create a VNRecognizeTextRequest
configure recognition settings
execute the request with VNImageRequestHandler
read VNRecognizedTextObservation results

Here is a minimal Swift example that extracts lines of text from a UIImage:

swift

1import UIKit
2import Vision
3
4enum OCRError: Error {
5    case missingCGImage
6}
7
8func recognizeText(from image: UIImage) async throws -> [String] {
9    guard let cgImage = image.cgImage else {
10        throw OCRError.missingCGImage
11    }
12
13    let request = VNRecognizeTextRequest()
14    request.recognitionLevel = .accurate
15    request.recognitionLanguages = ["en-US"]
16    request.usesLanguageCorrection = true
17
18    let handler = VNImageRequestHandler(
19        cgImage: cgImage,
20        orientation: CGImagePropertyOrientation(image.imageOrientation),
21        options: [:]
22    )
23
24    try handler.perform([request])
25
26    let observations = request.results ?? []
27    return observations.compactMap { observation in
28        observation.topCandidates(1).first?.string
29    }
30}

To bridge UIImage.Orientation into Vision's orientation type, add a small helper:

swift

1import ImageIO
2import UIKit
3
4extension CGImagePropertyOrientation {
5    init(_ orientation: UIImage.Orientation) {
6        switch orientation {
7        case .up: self = .up
8        case .down: self = .down
9        case .left: self = .left
10        case .right: self = .right
11        case .upMirrored: self = .upMirrored
12        case .downMirrored: self = .downMirrored
13        case .leftMirrored: self = .leftMirrored
14        case .rightMirrored: self = .rightMirrored
15        @unknown default: self = .up
16        }
17    }
18}

Understanding the Main Request Options

VNRecognizeTextRequest has a few settings that matter immediately.

recognitionLevel controls the speed and quality tradeoff:

'.fast is useful for quick scanning or real-time scenarios'
'.accurate is better for receipts, documents, and small text'

recognitionLanguages helps Vision choose the right language model. If you know the text is English, French, or another specific language, set that explicitly instead of relying on broad detection.

usesLanguageCorrection tells Vision to prefer plausible words over raw character guesses. That often improves document OCR, but it can hurt if you are scanning product codes, serial numbers, or intentionally unusual identifiers.

Vision also returns bounding boxes with each observation. That matters when you want to draw highlights over detected text, let users tap extracted regions, or preserve layout information.

Processing Results

Each VNRecognizedTextObservation may contain multiple text candidates ranked by confidence. For a simple app, taking the first candidate is usually fine. If you are building a scanner for forms or labels, reading several candidates can help when you want to apply your own validation logic.

Example:

swift

1for observation in request.results ?? [] {
2    let candidates = observation.topCandidates(3)
3    for candidate in candidates {
4        print(candidate.string, candidate.confidence)
5    }
6}

This is especially useful when the OCR output must match a known pattern, such as an invoice number or a license plate format. In that case, the top candidate is not always the best business result.

Working with Camera Frames

For live capture, you usually feed Vision frames from AVCaptureVideoDataOutput. The API shape is similar, but instead of a CGImage, you pass a pixel buffer into the handler.

That design lets you build camera-based features such as:

scanning labels in real time
reading a document before taking a photo
showing highlighted text boxes while the user moves the camera

When you do this, control request frequency. Running full OCR on every frame can overwhelm the device, so most apps sample frames or cancel outdated requests when a new frame arrives.

Common Pitfalls

The most frequent issue is wrong image orientation. If you skip the orientation parameter or map it incorrectly, OCR quality drops sharply because the model is effectively reading rotated text.

Another common mistake is doing OCR on the main thread. Text recognition can be expensive, especially with .accurate, so it should run off the UI thread and update the interface only when results are ready.

Developers also assume OCR failure means Vision is broken when the actual problem is input quality. Low contrast, motion blur, tiny text, or aggressive image compression can all ruin recognition. Cropping to the text area and using a clearer source image often helps more than changing code.

Finally, do not overcorrect with language settings. If your app scans mixed-language content or identifiers, forcing one language and enabling language correction can replace valid strings with dictionary words that look "more likely" to the model.

Summary

'VNRecognizeTextRequest is the standard Vision API for extracting text from images.'
The basic flow is image input, request configuration, handler execution, and observation parsing.
'recognitionLevel, recognitionLanguages, and usesLanguageCorrection strongly affect results.'
Correct orientation handling is essential for reliable OCR.
For camera-based OCR, run recognition asynchronously and avoid processing every frame at full accuracy.