Skip to content

Azure Data Fundamentals

Source: Course with Microsoft trainer Michelle Xie

  • Module 1: Explore core data concepts
  • Module 2: Explore relational data in Azure
  • Module 3: Explore non-relational data in Azure
  • Module 4: Explore modern data warehouse analytics

Data = Collection of facts, numbers, descriptions, objects, stored in a structured (e.g. tables), semi-structured (e.g. JavaScript Object Notation JSON), unstructured way (e.g. media, visuals)

  • Online Transactional Processing (OLTP) - discrete units of work
    • Normalization improves I/O, but not efficient for reporting
  • Online Analytical Processing (OLAP) - users query data
    • Data source (e.g. on premise, cloud, SaaS) > ingestion > storage > processing > visualization
      • Ingestion - no aggregation or heavy calculations
      • Processing - Clean and transformation
        • Time/date/metric conversions, summarize/aggregation, standardized, missing values added, de-duplication
        • Possible Azure services:
          • Functions
          • Databricks
          • Cognitive services
  • Ingestion can be batch or streaming
  • Extract, Transform, Load (ETL)
    • Can be continuous pipeline, good for low dependency of items
    • Sensitive data can be removed
  • Extract, Load, Transform (ELT)
    • Data is stored first
    • Good for complex model or processing that data depending on multiple items in database
    • Appropriate for cloud models
  • Pipelines
    • Azure Data Factory - pipelines to build data flows and handle ETL / ELT
  • Data Visualization
    • Should raise questions for business, support decision making, see history/trends
    • Implemented in PowerBI
  • Analytics, using statistics
    • Descriptive - what is happening
    • Diagnostic - root cause, anomalies
    • Predictive - what is likely to happen based on trends and patterns
    • Prescriptive - actions to take for a goal
    • Cognitive - self-learning, get conclusions based on existing analytics
  • Administrator
    • Database Management
    • Data Security
    • Backups
    • User Access
    • Monitors performance
    • Tools: Azure Data Studio, SQL Server Management Studio, Azure Portal/CLI
  • Engineer
    • Pipelines, processing
    • Ingestion
    • Prepare data for analytics
    • Tools: Azure Synapse Studio (portal, Azure Data Factory, SQL/Spark pools), SQL Server Management Studio, Azure Portal/CLI
  • Analyst
    • Provides insights into the data
    • Visual Reporting
    • Modeling Data for Analysis
    • Combines data for visualization and analysis
    • Tools: PowerBI server/desktop, report builder
  • Tables, using a model
    • Data is stored in a table
    • Table consists of rows and columns
    • All rows have same # of columns
    • Each column is defined by a datatype
  • Primary and Foreign keys
    • Primary - unique key in table
    • Foreign key - reference a primary key from another table
  • Index
    • Like index at end of book to quickly find pages for a topic
    • Optimizes search queries for faster data retrieval
    • Reduces the amount of data pages that need to be read to retrieve the data in a SQL Statement
    • Data is retrieved by joining tables together in a query
    • Consumes storage space and CRUD operations needs index maintenance and can impact DB operations.
      • Infrequently accessed but often queried data, indexes can speed up operations
      • Frequently CRUD tables, but infrequently queried data, indexes slow down operations
  • View
    • Custom table consisting of data from other tables
  • Entities
    • Non-relational data, no schema on data. Non-relational collections:
      • Can have multiple entities in same collection or container with different fields
      • Have different non-tabular schema
      • Are often defined by labeling each field with the name it represents (key-value pairs)
    • Queries can be made more complex as all entities must be parsed for non-standard fields, e.g. parse for unique key value
    • Example: Azure Cosmos DB
  • Non-relational data, semi-structured data
    • Data structure is defined within the actual data by fields. Format/file types include:
      • JSON
      • AVRO - Apache, row based, header has structure, body is JSON, good for data compression
      • ORC - data in columns rather than rows, Apache Hive, strip of data with an index, footer has statistical information on strip of data
      • Parquet - Cloudera/Twitter, columnar data format, metadata describing rows, chunk of data, designed for nested data types and good for encoding and compression
  • NoSQL
    • Loosely means non-relational data
    • Categories:
      • Key-value stores
        • Tool must parse contents and return
        • Usually restricted to insert and delete, updates are made in memory of client
        • Good for streaming large data amounts
        • Can be implemented by Azure Cosmos DB
      • Document based
        • Each document has unique ID
          • Good for queries on ID and indexed fields of documents and for streaming
        • Document has all data for an entity that in a relational data would be spread across multiple tables
        • May be stored as JSON, YAML, XML, binary JSON, text, and others
        • Good for data flexibility and frequent requirements changes
        • Can be implemented by Azure Cosmos DB via core SQL API
      • Column family databases
        • e.g. ORC, Parquet files. Similar to row based tables
        • Tabular data but with column families
          • E.g. person database with address information and family information
        • Popular is Apache Cassandra, implemented by in Azure Cosmos DB Cassandra API
      • Graph databases
        • Stores entities centric around relationships
        • Good for:
          • OLTP apps with highly correlated data
          • Easy updates to single or many objects
          • Flexible data modelling
          • Data requirements that evolve
          • Hierarchical data structures
            • e.g. people organization charts
        • In SQL would be expensive to query with recursion and joins.
        • Implemented as Azure Cosmos DB Gremlin API
  • Can be stored in Binary Live Object (BLOB) storage
  • Does not naturally contain fields:
    • Examples: video, audio, media streams, documents
    • Often used to extract data form and categorize or identify ”structures” to produce file’s metadata
    • Frequently used in combination with Machine Learning or Cognitive Services capabilities to “extract data” by using:
      • Text Analytics
      • Sentiment Analysis with Cognitive APIs
      • Vision API
  • SQL Server on Azure VMs - IaaS
    • OS level access
    • Higher administration time and dedicated hosting costs
    • Features:
      • SQL, OS versions, Windows, Linux, SSRS and SQL server services
      • Lift and shift to Azure
      • Licensing and edition flexibility
      • Memory, storage optimization
      • VNET integration with on premise, Advanced data security
      • High availability and Disaster Recovery (HADR) services
  • Azure SQL Managed Instance (MI) - PaaS
    • Compatibility with SQL server and VNET support
    • Control of SQL instance, multiple databases
    • Automates monitoring, backup
  • Azure SQL Database - PaaS
    • Good for latest SQL versions, variable loads, scalable loads
    • Features:
      • Elastic compute, storage
      • Private link supported
      • High availability 99.995% and 5 second RPO and 30 second RTO
      • Point in time restore
      • Most cost efficient
      • Integrates with Azure security services
  • Azure database for MySQL, Maria DB - PaaS
  • PaaS is lower cost and administration, but less control and customization (scripts, modules). If customization needs, use IaaS
  • PostgreSQL is the popular database for modern apps
    • Good for geometric data
    • Query language: pgSQL
    • Community version
    • Deployment options: single server, hyperscale (HA and scalable)
    • Compatible with PostgreSQL admin tool
  • MySQL for LAMP stack
    • Free community edition
    • Standard and Enterprise editions give performance, security improvements
  • MariaDB is community fork of MySQL with focus on user community
    • Re-written MySQL
    • Compatible with Oracle DB
    • Supports temporal data (history)
  • Fully managed services, integration with Azure services
  • High availability and low total cost of ownership
  • Configurable and automated performance, scaling, security, and compliance

In Azure Portal:

  • Basics > network connectivity > additional settings (data source, collation, time zone, advanced security and threat protection ) > tags > review and create
    • Basics = subscription, resource group, service config, name, admin credentials, region, pools, compute & storage, pricing
    • Network connectivity = publie/private, VNET, firewall, connection type
      • Azure SQL uses 1433
      • Need to add IP addresses of database clients
        • 0.0.0.0 allows all Azure services
        • Range 0.0.0.0 to 255.255.255.255 opens to all internet
        • Connection policy modes:
          • Redirect
            • Service tags can allow connectivity from certain Azure services
            • Recommended with policy restriction on connectors
          • Proxy
            • Connections proxies via Azure proxy gateway
            • All packets flows via gateway
          • Default:
            • Redirect all client connections in Azure (low latency)
            • Proxy all external connections
  • Network
    • Firewall rules
    • Virtual Network Rules
    • Private Link, private endpoint - Private IP from VNET
    • Can use Azure Gateway or ExpressRoute
    • VNET pairing to group VNETs and their resources
  • Authentication and Access Control
    • Integration with Azure AD
    • Server-level principle for logical server for DB
    • Support SSO, token based authentication, MFA
  • Role Based Access Control
    • Control Azure operations through Azure RBAC
    • Scoping in subscription, resource group, resource
    • Security principle - service user with role
  • Azure DB read replicas
    • Supports replication from master server to up to 5 replicas
    • Use cases: replica for analytics, DR in different region, good for read only workloads not CRUD, not optimized for write
    • Billed for storage
  • SQL is a standard language for use with relational databases

  • SQL standards are maintained by ANSI and ISO

  • Proprietary RDBMS systems have their own extensions/variants of SQL such as T—SQL (Microsoft), PL/SQL (Oracle), pgSQL (PostgreSQL)

  • SQL Statement Types

    • Data Manipulation Language - Used to query and manipulate data
      • SELECT - It retrieves data from a database
      • INSERT - It inserts data into a table
      • UPDATE - It updates existing data within a table
      • DELETE - It deletes all records from a table, the space for the records remain
      • MERGE - UPSERT operation (insert or update)
      • CALL - It calls a PL/SQL or Java subprogram
      • EXPLAIN PLAN - It explains the access path to data
      • LOCK TABLE - It controls concurrency
    • Data Definition Language - Used to define database objects
      • CREATE - used to create objects in the database
      • ALTER - used to alters the structure of the database
      • DROP - used to delete objects from the database
      • TRUNCATE - used to remove all records from a table, including all spaces allocated for the records are removed
      • COMMENT - used to add comments to the data dictionary
      • RENAME - used to rename an object
    • Data Control Language - used to control privilege in databases.
      • SYSTEM - creating a session, table, etc. are all types of system privilege.
      • OBJECT - any command or query to work on tables comes under object privilege. DCL is used to define two commands. These are:
      • GRANT - It gives user access privileges to a database.
      • REVOKE - It takes back permissions from the user.
  • Microsoft SQL
    • Azure Portal: On Azure SQL resource page, use query editor to execute queries in the web interface
    • SSMS
    • SQL Server Data Tools
    • Azure Data Studio
    • SQL CMD
    • Azure CLI/cloud shell
  • PostgreSQL
    • PG admin
  • MySQL
    • Azure cloud shell
    • MySQL workbench
  • Key-value store in Azure Storage account
  • Key = like row’s main key
  • Value = “columns”, but flexible
  • Example: key = customer ID, value = all data about that customer
  • No relationships, primary keys, stored procedures. Data is de-normalized.
  • Good for IoT systems for quick storage
  • Stored under Azure storage account
  • Block blobs - data stored as blocks, good for infrequent large discrete binary objects
  • Page blobs - collection fixed sized 512 byte pages, good for virtual disks for VMs
  • Append blobs - block blob to optimize append operations
  • Hot tier - frequent access, e.g. web access
  • Cool tier - infrequent access
  • Archive tier - slow retrieval in order of hours
  • Geo-replication supported
  • File shares
  • Can synchronize with local file shares
  • Geo-replication and encrypted is supported
  • Partitions of documents, document database, partition contains documents that share a partition key and are related to each other
  • Documents can refer to Azure Blob storage
  • Automatic indexes of document IDs, document fields. Allows queries of fields.
  • Configured as storage containers
  • Azure Portal: Data Explorer in resource allows CRUD on database like create databases, manage containers, queries
  • APIs and NoSQL functionality:
    • SQL
    • MongoDB
    • Gremlin (graph)
    • Table
    • Cassandra (columnar)
    • Allows switching between Cosmos features and other DB functionalities
  • Web, retail, marketing. Less than 10ms response time
  • Gaming - cloud storage of in-game stats, social media, leader boards = millisecond response times
  • Internet of Things (IoT) - ingest sensor real time data
  • Azure Portal
    • Provisioned throughput = dedicated resources, serverless = on demand and good for development
    • Multi-region writes = multiple master database around world
    • Data encryption = Microsoft manager or bring you own key
  • Azure Command Line Interface (CLI)
  • PowerShell
  • ARM templates (JSON code)
  • Access tiers hot/cool can be selected
  • Data lake storage gen2 - allows SA to be a data lake

Import data to Azure Cosmos DB from sources like:

  • JSON files
  • MongoDB
  • SQL Server
  • CSV files
  • Azure Table storage
  • Amazon DynamoDB
  • HBase
  • Azure Cosmos containers
  • User: password + option multifactor authentication (MFA), authentication token given to be used with application
  • Updates are asynchronous

Within a single region, Cosmos DB uses a cluster of servers. This approach helps to improve scalability and availability.

  1. Eventual
  2. Consistent Prefix - changes in order
  3. Session - changes in order
  4. Bounded Staleness - lag in read and write
  5. Strong - high latency, best consistency
  • Choose consistency to match requirements of application
  • Core SQL API allows dialect of SQL, like SELECT statements
SELECT <select_list>
[FROM <optional_from_specification)]
[WHERE <optional_filter_condition>]
[ORDER BY <optional_sort_specification>]
[JOIN <optional_join_specification>]
-- Example where c is a container
SELECT * FROM c
-- Display JSON data from container
SELECT * FROM c
WHERE c.Country == "Japan"
-- Dislay JSON data from records where field has Japan
-- Aggregation functions such as SUM, Average, Min, Max
SELECT COUNT(*) FROM Products p
SELECT SUM(p.quantity) FROM Products p
WHERE p.expired = 9
SELECT AVG(p.price) AS ‘Average Price’
FROM Products p
SELECT p1.ID, p.Name, p1.Description,
p1.Price FROM Products p1
WHERE p1.Price = (SELECT MIN(p2.Price) FROM
Product p2)
  • Manage BLOBs into containers, for example images container
  • Configure access: anonymous, public, authenticated, SAS token (with expiry dates)
  • SAS tokens can be configured for files
  • Create with access tier, storage quota
  • Manage files:
    • Desktop
    • az copy CLI command - specific source file, destination container/file

Ingest:

  • Azure Data Factory (ADF)
    • Data integration service to automate data moving and transformation
  • Databricks
    • Apache Spark based platform
      • Spark supports multiple languages (e.g. Python, Java, Scala, C#, R)
    • Uses Azure security
    • Integrates with Azure and PowerBI
    • Can process from different data sources

Model, & Serve:

  • Azure Synapse Analytics
    • Analytics engine
    • Synapse SQL, pipelines, Link, Studio, Spark
    • Transform data for analytics processing, generate reports

Visualize:

  • PowerBI can do visuals, ad hoc queries
  • Azure Active Directory = authentication for PowerBI and Analysis Services

Store:

  • Azure Data Lake Storage
    • Repository of data
    • Organizes data into directories
    • Support POSIX and RBAC permissions
    • Compatible with Hadoop distributed file system
  • Distributed processing compute Spark processing engine
  • Apache Kafka (messages), Hadoop (storage), Hive (query) supported
  • ADF - heterogenous ingestion
    • Linked server - data store, databricks
    • Trigger > Pipeline > Dataset
    • Uses parameters, integration runtime, control flow
  • PolyBase - file based data sources made to look like tables
  • SQL Server Integration Services (SSIS) - heterogenous - load, mine data, files

Example: Ingest data in Synapse for analytics processing

Section titled “Example: Ingest data in Synapse for analytics processing”
  • Create a storage account
    • Create a new file share, hot tier
    • Upload file
  • Create a ADF
  • Create an Azure Synapse Analytics workspace
    • Add SQL pool
    • At resource in Azure portal, open Synapse studio
      • Data > Databases
        • New SQL script, create a table
  • In ADF, set up new copy data
    • Select source as Azure File Storage and the file
    • Set destination as Synapse
    • Set up data to table mapping