Computer Vision in Azure

Exam weight: 15–20%
Study roadmap for this topic:
Image classification
Object detection
Optical Character Recognition (OCR)
Face detection and analysis
Computer Vision
As the name suggests, computer vision is the ability of AI to “see” and identify objects and images.
For this to be possible, models must be trained using large volumes of images.
Computer vision enables AI to classify images, detect objects, and even combine vision models with generative AI to create multimodal models.
Scenarios where computer vision is used
Automatic captioning or tag generation for photos
Visual search
Inventory level monitoring
Security video monitoring
Authentication using facial recognition
Robotics and autonomous vehicles
The term computational visual search refers to techniques used by AI software that process images, videos, or live camera streams to extract relevant information.
This type of visual processing has evolved significantly over the years.
Image classification
Image classification is one of the most commonly used computer vision solutions.
To achieve this, a model must be trained with correctly labeled images so it can learn how to detect each class, as shown in the example:

Object detection
Imagine a supermarket where products are automatically identified at checkout.
In this case, object detection is used. Detection models examine multiple regions of an image to identify objects and their locations.
Once detected, objects are represented by the coordinates of a bounding box:

Semantic segmentation
Semantic segmentation is a more advanced approach to object detection.
In this technique, each pixel in the image is assigned a label corresponding to the object it belongs to.
This results in a much more precise understanding of where objects are located within the image:

Contextual image analysis
Multimodal models are capable of locating and semantically interpreting an image, describing it in natural language.
For example, a model can identify that an image shows a person eating an apple:

Images and image processing
Many people may not realize this yet, but an image is essentially a matrix of numeric values.
Each pixel has a value between 0 and 255 (in the case of grayscale images).
For example, the matrix below:

Represents the corresponding image:

Color images
Color images are three-dimensional and contain three channels: Red, Green, and Blue (RGB).

An image like this consists of three layers of vectors, one for each color channel.
Purple squares are represented by:
Red: 150
Green: 0
Blue: 255
Yellow squares are represented by:
Red: 255
Green: 255
Blue: 0
Filters
One technique used to modify image colors and highlight patterns is the use of filters, which alter pixel values to create visual effects.
A filter is applied across the entire image, generating a new matrix of values and producing a new visual effect.
For a more detailed explanation of how filters are applied to images, refer to the official documentation here.
Convolutional Neural Networks (CNNs)
Applying filters is something we already use in everyday image editing.
However, in computer vision, filters are used to detect high-level visual patterns.
CNNs are deep learning architectures designed to simulate how the human brain processes images by identifying patterns such as shapes, edges, colors, and textures.
Below is an overview of how models are trained to identify bananas, apples, and oranges:

You provide images along with their labels (e.g., apple, banana, orange).
The network applies filters to the image, like lenses highlighting important features.
Each filter generates a feature map that detects edges, details, and patterns.
These features are progressively reduced and refined across layers. The network then predicts what object it sees and adjusts its weights when it makes mistakes.
After many training cycles, the model becomes good at recognizing new images.
Vision Transformers and multimodal models
Transformers, originally created for Natural Language Processing (NLP), are now also used in computer vision — and they work very differently from CNNs.
Vision Transformers (ViT) use techniques similar to NLP, but applied to image data.
Instead of text tokens, the transformer divides the image into patches of pixel values, which are then converted into vectors.
This creates a multidimensional map of relationships between parts of the image, enabling much deeper contextual understanding.
For example, a hat is visually associated with features commonly related to the head.


Image generation
The same multimodal architecture used to interpret images can also be used to generate them, through a technique called diffusion.
The process works as follows:
The model receives a prompt
It identifies the required visual elements
It starts from random noise
It iteratively transforms the noise until the final image is generated
Example prompt:
“A dog carrying a stick in its mouth”

Optical Character Recognition (OCR)
OCR is a capability that detects and extracts text from images.
Today, smartphones already perform this task by simply pointing the camera at a document.
Face detection and analysis
Face Detection is a service that identifies regions in an image containing human faces and returns the coordinates of a bounding box around each detected face.

Azure can detect faces and return multiple attributes, such as:
Accessories
Blur
Exposure
Glasses
Head pose
Mask
Noise
Occlusion
Recognition quality (high / medium / low)
Related links
Official exercise
https://learn.microsoft.com/training/modules/introduction-computer-vision/5b-exercise
Next post:
4 — Natural Language Processing (NLP) in Azure



