Computer Vision in Azure AI solutions
Source: My personal notes from Course AI-102T00-A: Develop AI solutions in Azure - Microsoft Learn with labs from Exercises for Develop vision solutions in Azure
Image Analysis
Section titled “Image Analysis”Use cases: create captions for image, image tagging, detect and locate objects and people
Existing models will respond with JSON with the image information. Response will include information like entity and the boundary box of the entity in the image.
Access can be using an SDK or API to submit images and get responses.
Example of Analyze Image
Section titled “Example of Analyze Image”Source: Analyze Image - Analyze Image - REST API (Azure Azure AI Services) | Microsoft Learn
Operation extracts a set of visual features based on the image content. Two input methods are supported
- Uploading an image
- Give an image URL
Within your request, there is an optional parameter to allow you to choose which features to return. By default, image categories are returned in the response. A successful response will be returned in JSON. If the request failed, the response will contain an error code and a message.
Sample request and response
{ "url": "{url}"}Sample Response
---
{ "categories": [ { "name": "abstract_", "score": 0.00390625 }, { "name": "people_", "score": 0.83984375, "detail": { "celebrities": [ { "name": "Satya Nadella", "faceRectangle": { "left": 597, "top": 162, "width": 248, "height": 248 }, "confidence": 0.999028444 } ] } }, { "name": "building_", "score": 0.984375, "detail": { "landmarks": [ { "name": "Forbidden City", "confidence": 0.9829016923904419 } ] } } ], "adult": { "isAdultContent": false, "isRacyContent": false, "isGoryContent": false, "adultScore": 0.0934349000453949, "racyScore": 0.06861349195241928, "goreScore": 0.012872257380997575 }, "tags": [ { "name": "person", "confidence": 0.9897908568382263 }, { "name": "man", "confidence": 0.9449388980865479 }, { "name": "outdoor", "confidence": 0.938492476940155 }, { "name": "window", "confidence": 0.8951393961906433 }, { "name": "pangolin", "confidence": 0.7250059783791661, "hint": "mammal" } ], "description": { "tags": [ "person", "man", "outdoor", "window", "glasses" ], "captions": [ { "text": "Satya Nadella sitting on a bench", "confidence": 0.48293603002174407 } ] }, "requestId": "0dbec5ad-a3d3-4f7e-96b4-dfd57efe967d", "metadata": { "width": 1500, "height": 1000, "format": "Jpeg" }, "modelVersion": "2021-04-01", "faces": [ { "age": 44, "gender": "Male", "faceRectangle": { "left": 593, "top": 160, "width": 250, "height": 250 } } ], "color": { "dominantColorForeground": "Brown", "dominantColorBackground": "Brown", "dominantColors": [ "Brown", "Black" ], "accentColor": "873B59", "isBWImg": false }, "imageType": { "clipArtType": 0, "lineDrawingType": 0 }, "objects": [ { "rectangle": { "x": 0, "y": 0, "w": 50, "h": 50 }, "object": "tree", "confidence": 0.9, "parent": { "object": "plant", "confidence": 0.95 } } ], "brands": [ { "name": "Pepsi", "confidence": 0.857, "rectangle": { "x": 489, "y": 79, "w": 161, "h": 177 } }, { "name": "Coca-Cola", "confidence": 0.893, "rectangle": { "x": 216, "y": 55, "w": 171, "h": 372 } } ]}Text in Images
Section titled “Text in Images”Use cases: get information from text like addresses, identifiers, numbers, photographer, license plates, and digitize notes
Optical character recognition (OCR) recognizes text and structure. Responses from the service include blocks and text and information for that block.
Example solution: A camera periodically uploads images to blob storage. On an event, the code checks for an update, extracts entities and responds.
Example of Operation: Recognize Printed Text
Section titled “Example of Operation: Recognize Printed Text”Sample response. Regions (blocks) have bounding boxes. Boxes have lines
and in the lines are words. Words are sets of text like the example
image A GOAL WITHOUT
Response is provided in json with the following structure:
- Metadata
- Regions
- Bounding boxes
- Lines
- Words
- Text
- Words
- Lines
- Bounding boxes
Each bounding box provides the coordinates in the image of the content (lines, words, text).
{ "language": "en", "textAngle": -2.0000000000000338, "orientation": "Up", "regions": [ { "boundingBox": "462,379,497,258", "lines": [ { "boundingBox": "462,379,497,74", "words": [ { "boundingBox": "462,379,41,73", "text": "A" }, { "boundingBox": "523,379,153,73", "text": "GOAL" }, { "boundingBox": "694,379,265,74", "text": "WITHOUT" } ] }, { "boundingBox": "565,471,289,74", "words": [ { "boundingBox": "565,471,41,73", "text": "A" }, { "boundingBox": "626,471,150,73", "text": "PLAN" }, { "boundingBox": "801,472,53,73", "text": "IS" } ] }, { "boundingBox": "519,563,375,74", "words": [ { "boundingBox": "519,563,149,74", "text": "JUST" }, { "boundingBox": "683,564,41,72", "text": "A" }, { "boundingBox": "741,564,153,73", "text": "WISH" } ] } ] } ], "modelVersion": "2021-04-01"}Facial Recognition: Detect, Analyze And Recognize Faces
Section titled “Facial Recognition: Detect, Analyze And Recognize Faces”Use cases: detect and locate faces, analyze facial features.
Using Azure AI face services requires approval due to sensitivity and are used for facial recognition solutions. For features, see Computer Vision (CV) Concepts and in Azure - Computer Vision (CV) Concepts and in Azure
Accessing Azure AI Vision Face resources can be done using Face SDK to get detection and face attributes and features responses:
- Detect faces
- Face attribute analysis (head pose, glasses, mask, other visual attributes and accessories)
- Facial landmarks locations
- Face comparison
- Facial recognition (specific individuals)
- Facial liveness (detect real stream)
Example response for an image containing a single face:
[ { 'faceRectangle': {'top': 174, 'left': 247, 'width': 246, 'height': 246} 'faceAttributes': { 'headPose':{'pitch': 3.7, 'roll': -7.7, 'yaw': -20.9}, 'accessories': [ {'type': 'glasses', 'confidence': 1.0} ], 'occlusion':{'foreheadOccluded': False, 'eyeOccluded': False, 'mouthOccluded': False} } }]Classify images and detect objects
Section titled “Classify images and detect objects”Use case: classify images (categorization) and detect object(s) and classify them
Classification can be:
- Multi class where each image is tagged with 1 class label from several, for example apple from [apple, orange, pineapple, banana].
- Multi label where each image can be tagged with multiple classes, for example fruit bowl with apple, orange, and banana
Example solutions: Food processing where image classification checks for “good” versions of a food product and remove “bad” versions. Medical imaging checks if a disease is present or not.
See Research Data - Mendeley Data for sample data sets to play with.
Azure AI Custom Vision
Section titled “Azure AI Custom Vision”Models can be trained with custom classifications.
Generative AI and Vision
Section titled “Generative AI and Vision”Multimodal generative AI with mixed media
Section titled “Multimodal generative AI with mixed media”A Multimodal generative AI model responds to prompts and returns created content. Prompts can include text, speech, and images and typically include a text part and media part.
Examples of models for multimodal generation:
Microsoft Phi-4-multimodal-instruct OpenAI gpt-4.1 OpenAI gpt-4.1-mini
Generate images with AI
Section titled “Generate images with AI”Uses prompt to create images. Example models that have the ability:
- OpenAI gpt-image-1 series of models.
- Black Forest Labs FLUX series of models.
Image generation can have responsible AI implications like other generative AI with malicious use.
Custom Vision
Section titled “Custom Vision”Object detection with Azure AI Custom Vision can handle:
- Train a custom model based on your own training images - Custom Vision training resource
- Create predictions from new images based on your trained model - Custom Vision prediction resource
Portal and SDK access allow training for custom image classification and object detection. Use includes:
Image labelling:
- Image classification: adds tags that apply to the whole image
- Object detection: bounding boxes for objects in image
Domains for Custom Vision
Section titled “Domains for Custom Vision”Domains are a starting point for a project optimized for specific use cases:
Image Classification
Section titled “Image Classification”- General, and General A1 (accuracy, large data) and A2 (accuracy, faster inference)
- Food, for example restaurant menus, dishes, fruits
- Landmarks, both natural and artificial
- Retail, for example shopping catalogue, shopping website, different shop items
- Compact - optimized for real time, edge devices
Object Detection
Section titled “Object Detection”- General
- Logo
- Products on Shelves
- Compact
The Azure Video Indexer service helps extract information from videos:
- Facial recognition - detecting the presence of individual people in the image. This requires Limited Access approval.
- Optical character recognition - reading text in the video.
- Speech transcription - creating a text transcript of spoken dialog in the video.
- Topics - identification of key topics discussed in the video.
- Sentiment - analysis of how positive or negative segments within the video are.
- Labels - label tags that identify key objects or themes throughout the video.
- Content moderation - detection of adult or violent themes in the video.
- Scene segmentation - a breakdown of the video into its constituent scenes.
The service includes predefined models to recognize celebrities, do OCR ,and transcription. Creating custom models is supported for recognizing other:
- People - known images of certain people
- Language - like specific terms
- Brands - identify products, projects, organizations
Exercise: Analyze Images
Section titled “Exercise: Analyze Images”Create Azure AI vision resource and using the SDK, submit images for captions, entity location, get objects, tagging, and people identification.
Exercise: Detect and analyze faces
Section titled “Exercise: Detect and analyze faces”Create a face service and using the SDK, submit images for face detection and boundary box location.
Exercise: Generate images with AI (DALL-E or Flux)
Section titled “Exercise: Generate images with AI (DALL-E or Flux)”Use the OpenAI DALL-E generative AI model like dall-e-3 or
flux.2-pro to create images and OpenAI Python SDK to create a simple
app to generate images based on prompts.
Exercise: Develop a vision-enabled chat app
Section titled “Exercise: Develop a vision-enabled chat app”Use a generative AI model to generate responses to prompts that include images. The app provides AI assistance with fresh produce in a grocery store by using Microsoft Foundry and the OpenAI SDK.
Exercise: Generate video with Sora in Microsoft Foundry
Section titled “Exercise: Generate video with Sora in Microsoft Foundry”Sora is an AI model from OpenAI that creates realistic and imaginative video scenes from text instructions. The model can generate a wide range of video content, including realistic scenes, animations, and special effects. It supports several video resolutions and durations, and can also use reference images and remix existing videos.
- Deploy the Sora model
- Generate video content using the Microsoft Foundry portal
- Create app that generates videos from images, polls for completion status, and remixes existing videos.
Exercise: Analyze images with Azure Content Understanding (ACU)
Section titled “Exercise: Analyze images with Azure Content Understanding (ACU)”ACU in Foundry Tools is a capability available in Microsoft AI Foundry that uses generative AI to analyze and interpret different types of unstructured content, including documents, images, audio, and video.
Using AI models with content can generate structured outputs following a user defined schema. These structured outputs make it easier to integrate extracted information into automation, analytics, and search workflows.
- Use Azure Content Understanding to analyze images and generate structured descriptions that help classify and index visual content, making it easier to locate relevant images and integrate them into search systems.
- Create and use an image analyzer in the Content Understanding Studio web interface
- Run the analyzer on sample images and review the generated
descriptions that can be used as metadata for indexing and search such
as
descriptionandtags. AI-generated image descriptions help make visual content searchable.