Azure's Powerful Computer Vision Tools

Exam weight: 15–20%

Study roadmap for this topic:

Image classification
Object detection
Optical Character Recognition (OCR)
Face detection and analysis

Computer Vision

As the name suggests, computer vision is the ability of AI to “see” and identify objects and images.
For this to be possible, models must be trained using large volumes of images.

Computer vision enables AI to classify images, detect objects, and even combine vision models with generative AI to create multimodal models.

Scenarios where computer vision is used

Automatic captioning or tag generation for photos
Visual search
Inventory level monitoring
Security video monitoring
Authentication using facial recognition
Robotics and autonomous vehicles

The term computational visual search refers to techniques used by AI software that process images, videos, or live camera streams to extract relevant information.
This type of visual processing has evolved significantly over the years.

Image classification

Image classification is one of the most commonly used computer vision solutions.
To achieve this, a model must be trained with correctly labeled images so it can learn how to detect each class, as shown in the example:

Object detection

Imagine a supermarket where products are automatically identified at checkout.
In this case, object detection is used. Detection models examine multiple regions of an image to identify objects and their locations.
Once detected, objects are represented by the coordinates of a bounding box:

Semantic segmentation

Semantic segmentation is a more advanced approach to object detection.
In this technique, each pixel in the image is assigned a label corresponding to the object it belongs to.
This results in a much more precise understanding of where objects are located within the image:

Contextual image analysis

Multimodal models are capable of locating and semantically interpreting an image, describing it in natural language.
For example, a model can identify that an image shows a person eating an apple:

Images and image processing

Many people may not realize this yet, but an image is essentially a matrix of numeric values.
Each pixel has a value between 0 and 255 (in the case of grayscale images).

For example, the matrix below:

Represents the corresponding image:

Color images

Color images are three-dimensional and contain three channels: Red, Green, and Blue (RGB).

An image like this consists of three layers of vectors, one for each color channel.

Purple squares are represented by:
- Red: 150
- Green: 0
- Blue: 255
Yellow squares are represented by:
- Red: 255
- Green: 255
- Blue: 0

Filters

One technique used to modify image colors and highlight patterns is the use of filters, which alter pixel values to create visual effects.

A filter is applied across the entire image, generating a new matrix of values and producing a new visual effect.

For a more detailed explanation of how filters are applied to images, refer to the official documentation here.

Convolutional Neural Networks (CNNs)

Applying filters is something we already use in everyday image editing.
However, in computer vision, filters are used to detect high-level visual patterns.

CNNs are deep learning architectures designed to simulate how the human brain processes images by identifying patterns such as shapes, edges, colors, and textures.

Below is an overview of how models are trained to identify bananas, apples, and oranges:

You provide images along with their labels (e.g., apple, banana, orange).
The network applies filters to the image, like lenses highlighting important features.
Each filter generates a feature map that detects edges, details, and patterns.
These features are progressively reduced and refined across layers. The network then predicts what object it sees and adjusts its weights when it makes mistakes.
After many training cycles, the model becomes good at recognizing new images.

Vision Transformers and multimodal models

Transformers, originally created for Natural Language Processing (NLP), are now also used in computer vision — and they work very differently from CNNs.

Vision Transformers (ViT) use techniques similar to NLP, but applied to image data.
Instead of text tokens, the transformer divides the image into patches of pixel values, which are then converted into vectors.

This creates a multidimensional map of relationships between parts of the image, enabling much deeper contextual understanding.
For example, a hat is visually associated with features commonly related to the head.

Image generation

The same multimodal architecture used to interpret images can also be used to generate them, through a technique called diffusion.

The process works as follows:

The model receives a prompt
It identifies the required visual elements
It starts from random noise
It iteratively transforms the noise until the final image is generated

Example prompt:

“A dog carrying a stick in its mouth”

Optical Character Recognition (OCR)

OCR is a capability that detects and extracts text from images.
Today, smartphones already perform this task by simply pointing the camera at a document.

Face detection and analysis

Face Detection is a service that identifies regions in an image containing human faces and returns the coordinates of a bounding box around each detected face.

Azure can detect faces and return multiple attributes, such as:

Accessories
Blur
Exposure
Glasses
Head pose
Mask
Noise
Occlusion
Recognition quality (high / medium / low)

Official exercise

https://learn.microsoft.com/training/modules/introduction-computer-vision/5b-exercise

Next post:

4 — Natural Language Processing (NLP) in Azure

Computer Vision in Azure

Computer Vision

Scenarios where computer vision is used

Image classification

Object detection

Semantic segmentation

Contextual image analysis

Images and image processing

Color images

Filters

Convolutional Neural Networks (CNNs)

Vision Transformers and multimodal models

Image generation

Optical Character Recognition (OCR)

Face detection and analysis

Comments

AI-900

Natural Language Processing (NLP) in Azure

More from this blog

Describing Generative AI capabilities in Azure

Natural Language Processing (NLP) in Azure

Machine Learning Fundamentals in Azure

AI for Beginners: Achieve the AI-900 Certification in Azure AI Fundamentals

Command Palette

Computer Vision

Scenarios where computer vision is used

Image classification

Object detection

Semantic segmentation

Contextual image analysis

Images and image processing

Color images

Filters

Convolutional Neural Networks (CNNs)

Vision Transformers and multimodal models

Image generation

Optical Character Recognition (OCR)

Face detection and analysis

Comments

AI-900

Natural Language Processing (NLP) in Azure

More from this blog