Skip to main content

Command Palette

Search for a command to run...

Computer Vision in Azure

Updated
5 min read
Computer Vision in Azure

Exam weight: 15–20%

Study roadmap for this topic:

  • Image classification

  • Object detection

  • Optical Character Recognition (OCR)

  • Face detection and analysis

Computer Vision

As the name suggests, computer vision is the ability of AI to “see” and identify objects and images.
For this to be possible, models must be trained using large volumes of images.

Computer vision enables AI to classify images, detect objects, and even combine vision models with generative AI to create multimodal models.

Scenarios where computer vision is used

  • Automatic captioning or tag generation for photos

  • Visual search

  • Inventory level monitoring

  • Security video monitoring

  • Authentication using facial recognition

  • Robotics and autonomous vehicles

The term computational visual search refers to techniques used by AI software that process images, videos, or live camera streams to extract relevant information.
This type of visual processing has evolved significantly over the years.

Image classification

Image classification is one of the most commonly used computer vision solutions.
To achieve this, a model must be trained with correctly labeled images so it can learn how to detect each class, as shown in the example:

Object detection

Imagine a supermarket where products are automatically identified at checkout.
In this case, object detection is used. Detection models examine multiple regions of an image to identify objects and their locations.
Once detected, objects are represented by the coordinates of a bounding box:

Semantic segmentation

Semantic segmentation is a more advanced approach to object detection.
In this technique, each pixel in the image is assigned a label corresponding to the object it belongs to.
This results in a much more precise understanding of where objects are located within the image:

Contextual image analysis

Multimodal models are capable of locating and semantically interpreting an image, describing it in natural language.
For example, a model can identify that an image shows a person eating an apple:

Images and image processing

Many people may not realize this yet, but an image is essentially a matrix of numeric values.
Each pixel has a value between 0 and 255 (in the case of grayscale images).

For example, the matrix below:

Represents the corresponding image:

Color images

Color images are three-dimensional and contain three channels: Red, Green, and Blue (RGB).

An image like this consists of three layers of vectors, one for each color channel.

  • Purple squares are represented by:

    • Red: 150

    • Green: 0

    • Blue: 255

  • Yellow squares are represented by:

    • Red: 255

    • Green: 255

    • Blue: 0

Filters

One technique used to modify image colors and highlight patterns is the use of filters, which alter pixel values to create visual effects.

A filter is applied across the entire image, generating a new matrix of values and producing a new visual effect.

For a more detailed explanation of how filters are applied to images, refer to the official documentation here.

Convolutional Neural Networks (CNNs)

Applying filters is something we already use in everyday image editing.
However, in computer vision, filters are used to detect high-level visual patterns.

CNNs are deep learning architectures designed to simulate how the human brain processes images by identifying patterns such as shapes, edges, colors, and textures.

Below is an overview of how models are trained to identify bananas, apples, and oranges:

  1. You provide images along with their labels (e.g., apple, banana, orange).

  2. The network applies filters to the image, like lenses highlighting important features.

  3. Each filter generates a feature map that detects edges, details, and patterns.

  4. These features are progressively reduced and refined across layers. The network then predicts what object it sees and adjusts its weights when it makes mistakes.

  5. After many training cycles, the model becomes good at recognizing new images.

Vision Transformers and multimodal models

Transformers, originally created for Natural Language Processing (NLP), are now also used in computer vision — and they work very differently from CNNs.

Vision Transformers (ViT) use techniques similar to NLP, but applied to image data.
Instead of text tokens, the transformer divides the image into patches of pixel values, which are then converted into vectors.

This creates a multidimensional map of relationships between parts of the image, enabling much deeper contextual understanding.
For example, a hat is visually associated with features commonly related to the head.

Image generation

The same multimodal architecture used to interpret images can also be used to generate them, through a technique called diffusion.

The process works as follows:

  • The model receives a prompt

  • It identifies the required visual elements

  • It starts from random noise

  • It iteratively transforms the noise until the final image is generated

Example prompt:

“A dog carrying a stick in its mouth”

Optical Character Recognition (OCR)

OCR is a capability that detects and extracts text from images.
Today, smartphones already perform this task by simply pointing the camera at a document.

Face detection and analysis

Face Detection is a service that identifies regions in an image containing human faces and returns the coordinates of a bounding box around each detected face.

Azure can detect faces and return multiple attributes, such as:

  • Accessories

  • Blur

  • Exposure

  • Glasses

  • Head pose

  • Mask

  • Noise

  • Occlusion

  • Recognition quality (high / medium / low)

Related links

https://learn.microsoft.com/training/modules/introduction-computer-vision/
https://learn.microsoft.com/training/modules/get-started-computer-vision-azure/4-face-service

Official exercise

https://learn.microsoft.com/training/modules/introduction-computer-vision/5b-exercise

Next post:

4 — Natural Language Processing (NLP) in Azure

AI-900

Part 3 of 5

Learning something new isn’t always easy — and AI is no exception. Many materials are too technical for beginners. So, I decided to turn my studies into accessible content, both for people who already work in tech and for those who want to start now.

Up next

Natural Language Processing (NLP) in Azure

Exam weight: 15–20% Study roadmap for this topic: Entity recognition Language detection Sentiment analysis Key phrase extraction Speech recognition and synthesis Language models Natural Language Processing (NLP) Natural Language Processing (N...