Show full map

Azure Dedicated SQL Pool (SQL DW) Basics

Technology

Ian Bradshaw

Get Started. It's Free

or sign up with your email address

Similar Mind Maps Mind Map Outline

Azure Dedicated SQL Pool (SQL DW) Basics by Ian Bradshaw Mind Map: Azure Dedicated SQL Pool (SQL DW) Basics

1. Loading sample tables from Adventure Works LT

1.1. This is a quick way to put some data into the dedicated SQL pool for the purpose of experimenting and learning

1.2. Pre-requisite: provision Azure SQL Database with sample database

1.2.1. To save costs, choose Basic, limited to 2GB max size

1.3. Open Synapse Studio and create a linked service for the SQL Database holding the sample database

1.4. Create a pipeline to copy SalesLT.Customer table data into dedicated SQL pool

1.4.1. In Source settings, create a new dataset that uses the SQL Database linked service and points to the SalesLT.Customer table

1.4.1.1. In Sink settings, create a new dataset for Azure Synapse dedicated SQL pool, choose Polybase as the Copy method and select option to auto create table

1.4.1.1.1. In Settings, enable the staging option, and using your integrated ADLS Gen2 storage account, set a staging path

1.5. Validate the pipeline

1.5.1. Gotcha! You might get a validation error if you omit to specify a table name in the Azure Synapse dedicated SQL pool sink dataset

1.5.1.1. If required, just go back to the dataset, click the Edit checkbox and enter a table name for the dedicated SQL pool table, such as Staging.AW_Customer

1.6. Debug pipeline

1.7. Create a SQL script and run a simple Select statement against the new staging table to verify its contents

1.8. Repeat Copy activity process for any other tables you want (e.g. SalesLT.SalesOrderHeader)

2. Architecture

2.1. Dedicated SQL Pool has MPP architecture

2.1.1. MPP = Massively Parallel Processing

2.2. MPP systems provide limitless capacity for scaling out to meet increasing workload demands

2.3. MPP systems contrast with traditional SMP (Symetric Parallel Processing) systems

2.3.1. SMP systems can be thought of as single server with a single host OS, memory, local storage, etc.

2.3.1.1. To meet increased workload demands, SMP systems can be scaled up (adding more RAM, processors, etc.), but this always has a limit

2.3.2. MPP systems consist of multiple servers forming a clustered system, each having its own OS and resources

2.3.2.1. To increase capacity for higher workloads, we just add nodes (servers) to our cluster, in a scaling out process

2.4. Dedicated SQL Pool has a multi-node architecture consisting of a control node, multiple compute nodes and separate ADLS Gen2 storage

2.4.1. Applications connect to control node and issue T-SQL commands

2.4.2. The control node hosts the distributed query engine and optimises queries for parallel execution before distributing work to the compute nodes to do their work in parallel

2.4.3. Data Movement Service (DMS) is an internal system-level service that moves data between nodes as required for their workloads

2.5. Distributions

2.5.1. A distribution is the basic unit of storage and processing for parallel queries that run on distributed data

2.5.2. When Synapse SQL runs a query, the work is divided into 60 smaller queries that run in parallel

2.5.3. Each of the 60 smaller queries runs on one of the data distributions

2.5.4. Each Compute node manages one or more of the 60 distributions

2.5.4.1. On a cluster scaled out to max capacity, each compute node manages 1 distribution

2.5.4.2. On the smallest possible cluster, one compute node manages all 60 distributions

2.5.5. You can think of the 60 distributions as 60 databases, each holding the same set of tables but where each table is divided into 60 parts based on a distribution method

2.5.5.1. When inserting data into a table, rows are allocated to a distribution based on a distribution method specified in the original CREATE TABLE statement

2.5.5.1.1. The distribution is specified as part of the WITH clause in a CREATE TABLE statement

2.5.6. There are two main distribution methods: HASH and ROUND ROBIN

2.5.6.1. Hash

2.5.6.1.1. A hash distributed table can deliver the highest query performance for joins and aggregations on large tables

2.5.6.1.2. To shard data into a hash-distributed table, a hash function is used to deterministically assign each row to one distribution

2.5.6.1.3. The main gotcha with Hash distributed tables is data skew, which can result in table data being very unevenly distributed across the 60 distributions

2.5.6.2. Round Robin

2.5.6.2.1. A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads

2.5.6.2.2. A round-robin distributed table distributes data evenly across the table but without any further optimization

2.5.6.2.3. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially

2.5.6.2.4. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables

2.5.6.2.5. The main gotcha with Round Robin distributed tables is excessive data movement (via DMS)

2.5.6.2.6. Round robin is the default distribution if you do not specify one within your CREATE TABLE statement

2.5.6.2.7. We can either specify the round robin distribution explicitly as part of the WITH clause in a CREATE TABLE statement, or if we omit it in the WITH clause then we get round robin by default

2.6. Replicated tables

2.6.1. To mitigate performance issues caused by data movement, we have the option of replicated tables, which involves caching a full copy of a given table on each compute node

2.6.2. A replicated table provides the fastest query performance for small tables

2.6.3. Replicating a table removes the need to transfer data among compute nodes before a join or aggregation

2.6.4. The main gotcha with replicated tables is overhead on all table write operations and more storage consumed for all the different copies of the table

2.6.4.1. For this reason, replicated tables are only appropriate for relatively small tables <2GB in size

2.6.4.2. Most commonly replicated tables will be dimension tables in a data warehouse

2.6.5. We specify a table for replication via the WITH clause in a CREATE TABLES statement, using the DISTRIBUTION option

2.6.5.1. Example:

2.6.5.1.1. CREATE TABLE dbo.DimDate ( DateKey int NOT NULL, FullDateAlternateKey date NOT NULL, DayNameOfWeek nvarchar(10 NOT NULL, CalendarQuarter tinyint NOT NULL, CalendarYear smallint NOT NULL ) WITH ( DISTRIBUTION = REPLICATE, CLUSTERED INDEX (DateKey) )

2.6.6. The replication of the table to all compute nodes is only triggered when the table in question is used in a query, as prior to that it will exist with a round robin distribution

2.6.6.1. So on first query involving a replicated table there will be data movement to copy the table to all compute nodes, and thereafter there will be no data movement in respect to that table

3. Indexation

3.1. When considering indexation strategy for tables, we have 3 options for primary indexes: heap, clustered and clustered columnstore

3.1.1. Heap is generally the right choice for staging or temporary tables and small tables with small lookups

3.1.2. Clustered index is generally the right choice for tables up to 100 million rows

3.1.3. Clustered columnstore index is the right choice for tables over 100 million rows

3.1.3.1. Clustered columnstore is the default index applied if none is specified in the WITH clause of the CREATE TABLE statement

3.1.4. Syntax

3.1.4.1. Clustered columnstore index (explicit syntax)

3.1.4.1.1. CREATE TABLE myTable ( id int NOT NULL, lastName varchar(20), zipCode varchar(6) ) WITH ( CLUSTERED COLUMNSTORE INDEX );

3.1.4.2. Clustered (rowstore) index

3.1.4.2.1. CREATE TABLE myTable ( id int NOT NULL, lastName varchar(20), zipCode varchar(6) ) WITH ( CLUSTERED INDEX (id) );

3.1.4.3. Heap (no index)

3.1.4.3.1. CREATE TABLE myTable ( id int NOT NULL, lastName varchar(20), zipCode varchar(6) ) WITH ( HEAP );

3.1.4.4. Non-clustered (rowstore) index

3.1.4.4.1. CREATE INDEX zipCodeIndex ON myTable (zipCode);

4. PolyBase and external tables

4.1. PolyBase uses external tables to read from or write to external data sources using T-SQL, and combine external and internal (i.e. normal) tables in familiar T-SQL queries

4.2. Using the CREATE EXTERNAL TABLE statement, we can import data from Hadoop, Azure blob storage or ADLS Gen2 without needing an ETL tools

4.2.1. For an external table, only the table metadata along with basic statistics about the file or folder that is referenced in Azure Data Lake, Hadoop, or Azure blob storage

4.2.2. No actual data is moved or stored when external tables are created

4.2.3. For best performance, if the external data source driver supports a three-part name, it is strongly recommended to provide the three-part name

4.2.3.1. This means database_name.schema_name.table_name after CREATE EXTERNAL TABLE

4.2.4. Text, nText and XML are not supported data types for columns in external tables for dedicated SQL pool

4.3. Example: leverage PolyBase to import delimited data file(s) stored in ADLS Gen2 into a SQL table

4.3.1. CREATE EXTERNAL TABLE [dbo].[DimProductexternal] ( [ProductKey] [int] NOT NULL, [ProductLabel] nvarchar NULL, [ProductName] nvarchar NULL ) WITH ( LOCATION='/DimProduct/' , DATA_SOURCE = AzureDataLakeStore , FILE_FORMAT = TextFileFormat , REJECT_TYPE = VALUE , REJECT_VALUE = 0 ) ; CREATE TABLE [dbo].[DimProduct] WITH (DISTRIBUTION = HASH([ProductKey] ) ) AS SELECT * FROM [dbo].[DimProduct_external] ;

4.3.1.1. Note that the CTAS statement leverages PolyBase to import the data into a dedicated SQL pool table from where we can leverage the performance benefits of the pool's clustered columnstore indexation and MPP architecture

4.3.1.2. We can also reference the external table directly in T-SQL queries without importing the data if we want to

4.3.1.3. Reject parameters determine how PolyBase will handle dirty records it retrieves from the external data source

4.3.1.3.1. A data record is considered 'dirty' if it actual data types or the number of columns don't match the column definitions of the external table

4.3.1.3.2. value REJECT_VALUE is a literal value, not a percentage

4.3.1.3.3. The PolyBase query will fail when the number of rejected rows exceeds reject_value

4.3.1.3.4. So in this example, the PolyBase query underpinning the CTAS statement will fail if the number of dirty records exceeds 0

4.3.1.3.5. You can also specify a directory location on the external data source to write the rejected rows and corresponding error file, which is passed via the REJECTED_ROW_LOCATION option

4.3.1.4. This LOCATION option can specify either a file or folder path

4.3.1.4.1. If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its subfolders

4.3.1.4.2. Just like Hadoop, PolyBase doesn't return hidden folders and it also doesn't return files for which the file name begins with an underline (_) or a period (.)

4.4. Like CETAS, we have the same pre-requisites in order to create and use external tables created via CREATE EXTERNAL TABLE

4.5. In ad-hoc query scenarios, such as SELECT FROM EXTERNAL TABLE, PolyBase stores the rows that are retrieved from the external data source in a temporary table

4.5.1. After the query completes, PolyBase removes and deletes the temporary table. No permanent data is stored in SQL tables

4.6. Permissions required in dedicated SQL pool to create external tables and import that external data in via CTAS:

4.6.1. CREATE TABLE

4.6.2. ALTER ANY SCHEMA

4.6.3. ALTER ANY EXTERNAL DATA SOURCE

4.6.4. ALTER ANY EXTERNAL FILE FORMAT

4.7. Permissions required to create the master key, database scoped credential and external data source in dedicated SQL pool:

4.7.1. CONTROL DATABASE

5. Statistics

5.1. Originally in Azure SQL DW (at least around Jan 2019) automatic statistics on tables were not supported so you had to manually create and update them, but the AUTO_CREATE_STATISTICS option is enabled at database level by default now

5.1.1. The following statements trigger the creation of statistics:

5.1.1.1. SELECT

5.1.1.2. INSERT-SELECT

5.1.1.3. CTAS

5.1.1.4. UPDATE

5.1.1.5. DELETE

5.1.1.6. EXPLAIN

5.1.1.6.1. when containing a join or the presence of a predicate is detected

5.1.2. The automatic creation of statistics is not generated on temporary or external tables

5.1.3. When automatic statistics are created, they'll take the form: WA_Sys<8 digit column id in Hex>_<8 digit table id in Hex>

5.1.4. You can view already created stats by running the DBCC SHOW_STATISTICS command

5.1.4.1. DBCC SHOW_STATISTICS (<table_name>, <target>)

5.2. Even with the automated creation of statistics enabled at the database level, it is very important that we update our statistics and generally speaking this should happen after each batch load

5.2.1. If you find it is taking too long to update all of your statistics, you may want to try to be more selective about which columns need frequent statistics updates

5.2.1.1. You will gain the most benefit by having updated statistics on columns involved in joins, columns used in the WHERE clause and columns found in GROUP BY

5.2.1.2. One best practice is to update statistics on date columns each day as new dates are added

5.3. Statistics management

5.3.1. The data load is when tables most frequently change their size, distribution of values, or both

5.3.2. As such, the load process is a logical place to implement some management processes

5.3.3. Guiding principles:

5.3.3.1. Ensure that each loaded table has at least one statistics object updated

5.3.3.1.1. This process updates the table size (row count and page count) information as part of the statistics update

5.3.3.2. Focus on columns participating in JOIN, GROUP BY, ORDER BY, and DISTINCT clauses

5.3.3.3. Consider updating "ascending key" columns such as transaction dates more frequently because these involve adding new values every day, which won't appear in the statistics histogram until an update

5.3.3.4. Consider updating static distribution columns less frequently

5.3.3.5. Remember, each statistic object is updated in sequence

5.3.3.5.1. Simply implementing UPDATE STATISTICS <TABLE_NAME> isn't always ideal, especially for wide tables with lots of statistics objects

5.4. Troubleshooting bad query performance

5.4.1. Out of date statistics on a table and/or one or more columns that are important for a given SQL query can lead to the query optimizer generating a sub-optimal execution plan

5.4.2. One of the first questions to ask when you're troubleshooting a query is, "Are the statistics up to date?"

5.4.2.1. Bear in mind that "old" statistics objects (i.e. last updated a relatively long time ago) for columns with static values and distributions can still represent "up to date" statistics

5.4.2.2. The following query can be useful (consider altering the WHERE clause if you rely mostly on auto statistics)

5.4.2.2.1. SELECT sm.[name] AS [schema_name], tb.[name] AS [table_name], co.[name] AS [stats_column_name], st.[name] AS [stats_name], STATS_DATE(st.[object_id],st.[stats_id]) AS [stats_last_updated_date] FROM sys.objects ob JOIN sys.stats st ON ob.[object_id] = st.[object_id] JOIN sys.stats_columns sc ON st.[stats_id] = sc.[stats_id] AND st.[object_id] = sc.[object_id] JOIN sys.columns co ON sc.[column_id] = co.[column_id] AND sc.[object_id] = co.[object_id] JOIN sys.types ty ON co.[user_type_id] = ty.[user_type_id] JOIN sys.tables tb ON co.[object_id] = tb.[object_id] JOIN sys.schemas sm ON tb.[schema_id] = sm.[schema_id] WHERE st.[user_created] = 1;

5.5. Creating statistics

5.5.1. Generally speaking, it's a good idea to leave auto statistics creation enabled on all the tables, but bear in mind that dedicated SQL pool will only create statistics objects on single columns as and when they appear in a query

5.5.2. We can manually create statistics objects on a table using this syntax:

5.5.2.1. CREATE STATISTICS [statistics_name] ON [schema_name].[table_name]([column_name]);

5.5.3. By default, a sampling rate of 20% of the table will be used in order to create a statistics object (a histogram of values)

5.5.3.1. Example:

5.5.3.1.1. CREATE STATISTICS col1_stats ON dbo.table1 (col1);

5.5.3.2. Note: for tables with more than 1 billion rows, a sampling of two percent is recommended

5.5.4. We can force a new statistics object to scan the whole table by adding the FULLSCAN option

5.5.4.1. CREATE STATISTICS col1_stats ON dbo.table1 (col1) WITH FULLSCAN;

5.5.5. We can specify a specific sampling size by adding the SAMPLE option

5.5.5.1. CREATE STATISTICS col1_stats ON dbo.table1 (col1) WITH SAMPLE = 50 PERCENT;

5.5.6. Filtered statistics can only be created manually and these can be useful to aid queries on partitioned tables

5.5.6.1. CREATE STATISTICS stats_col1 ON table1(col1) WHERE col1 > '2000101' AND col1 < '20001231';

5.5.7. Multi-column statistics can only be created manually and these can be useful when applied to correlated columns (think hierarchies) that are accessed at the same time in a query

5.5.7.1. CREATE STATISTICS stats_2cols ON table1 (product_category, product_sub_category) WHERE product_category > '2000101' AND product_category < '20001231' WITH SAMPLE = 50 PERCENT;

5.6. Updating statistics

5.6.1. We can update statistics on a single statistics object

5.6.1.1. Example:

5.6.1.1.1. UPDATE STATISTICS [dbo].[table1] ([stats_col1]);

5.6.1.2. By updating specific statistics objects, you can minimize the time and resources required to manage statistics

5.6.1.3. This action requires some thought for selecting the best statistics objects to update

5.6.2. We can update all statistics on a whole table

5.6.2.1. UPDATE STATISTICS dbo.table1;

5.6.2.2. Just remember that it updates all statistics on the table, prompting more work than is necessary

5.6.2.2.1. If performance isn't an issue, this method is the easiest and most complete way to guarantee that statistics are up to date

5.6.2.2.2. If the table is large and has many columns and many statistics, it might be more efficient to update individual statistics based on need

6. Scale Compute

6.1. The architecture of dedicated SQL pool (formerly SQL DW) separates storage and compute, allowing each to scale independently

6.2. You can scale compute to meet performance demands independent of data storage, and you can also pause and resume compute resources

6.3. Compute performance is measured in Data Warehouse Units (DWUs), which is an abstraction that represents a level of performance that combines CPU, memory, and IO

6.4. Scaling out or back can be done via the Portal, T-SQL, PowerShell or the REST API

6.4.1. DW100c is the minimum performance level, which provides a single compute node

6.4.1.1. Note: DW100c represents 100 DWUs

6.4.2. DW30000c is the current maximum performance level, which provides 60 compute nodes

6.5. To perform a scale out/back operation dedicated SQL pool starts by killing any active queries and rolling back transactions before detaching storage from compute, adding/removing compute nodes, then re-attaching storage to compute layer

6.6. Scaling out/back with T-SQL

6.6.1. The service objective setting contains the number of data warehouse units for the dedicated SQL pool

6.6.1.1. For example, if your data warehouse database were named mySampleDataWarehouse, you can run the following T-SQL to check the current DWU performance level

6.6.1.1.1. SELECT db.name [Database] , ds.edition [Edition] , ds.service_objective [Service Objective] FROM sys.database_service_objectives ds JOIN sys.databases db ON ds.database_id = db.database_id WHERE db.name = 'mySampleDataWarehouse'

6.6.1.2. You can scale out/back using a simple ALTER DATABASE statement, for example:

6.6.1.2.1. ALTER DATABASE mySampleDataWarehouse MODIFY (SERVICE_OBJECTIVE = 'DW300c');

6.6.1.3. To check the status of your scale out/back operation, you can use sys.dm_operation_status, for example:

6.6.1.3.1. SELECT * FROM sys.dm_operation_status WHERE resource_type_desc = 'Database' AND major_resource_id = 'mySampleDataWarehouse'

7. Security

7.1. Connection security

7.1.1. Connection Security refers to how you restrict and secure connections to your database using firewall rules and connection encryption

7.1.2. Dedicated SQL pool (formerly SQL DW) uses server-level IP firewall rules

7.1.2.1. Azure SQL Server is a logical server, not a physical one, acting as a container for databases that you apply common settings to, such as security settings

7.1.2.2. In the context of dedicated SQL pool that is provisioned as part of Synapse, the logical server is the Synapse workspace, so it's the Synapse workspace where server-level IP firewall rules are set

7.1.2.3. It doesn't support database-level IP firewall rules

7.1.3. Connections to your dedicated SQL pool (formerly SQL DW) are encrypted by default

7.1.3.1. Modifying connection settings to disable encryption are ignored

7.2. Authentication

7.2.1. Authentication refers to how you prove your identity when connecting to the database

7.2.2. Dedicated SQL pool (formerly SQL DW) currently supports SQL Server Authentication with a username and password, and with Azure Active Directory

7.2.3. To create a SQL Server Authenticated user, connect to the master database on your server with your server admin login and create a new server login

7.2.3.1. It's a good idea to also create a user in the master database

7.2.3.1.1. Creating a user in master allows a user to log in using tools like SSMS without specifying a database name

7.2.3.1.2. It also allows them to use the object explorer to view all databases on a server

7.2.3.1.3. Example:

7.2.4. To give a user permission to perform additional operations such as creating logins or creating new databases, assign the user to the Loginmanager and dbmanager roles in the master database

7.2.5. For AAD authentication, you must first set an existing AAD user as the SQL Active Directory admin, which is done via Synapse workspace

7.2.5.1. Azure Active Directory authentication with Azure Synapse requires using contained database users based on an Azure AD identity

7.2.5.2. A contained database user does not have a login in the master database, and maps to an identity in Azure AD that is associated with the database

7.2.5.2.1. The Azure AD identity can be either an individual user account or a group

7.2.5.3. Database users (with the exception of administrators) cannot be created using the Azure portal

7.2.5.3.1. Azure roles are not propagated to the database

7.2.5.3.2. Access permission must be granted directly in the database using Transact-SQL statements

7.2.5.4. To create a contained database user for an AAD user or group, connect to the dedicated SQL pool using AAD

7.2.5.4.1. Typically you will make this AAD authenticated connection using the SQL Active Directory admin user - but can be any user that the ALTER ANY USER permission granted for the database

7.2.5.4.2. The syntax for creating a contained database user that maps to an AAD user, group or registered application is:

7.2.5.5. Removing the Azure Active Directory administrator for the server prevents any Azure AD authentication user from connecting to the server

7.3. Authorization

7.3.1. Authorization refers to what you can do within a database once you are authenticated and connected

7.3.2. Authorization privileges are determined by role memberships and permissions

7.3.3. Use sp_addrolemember stored proc to assign database users membership of database roles

7.3.3.1. Example:

7.3.3.1.1. EXEC sp_addrolemember 'db_datareader', 'ApplicationUser'; -- allows ApplicationUser to read data EXEC sp_addrolemember 'db_datawriter', 'ApplicationUser'; -- allows ApplicationUser to write data

7.3.4. The server admin account is a member of db_owner, which has authority to do anything within the database

7.3.5. More granular permissions can be set using the GRANT statement

7.3.5.1. Example:

7.3.5.1.1. --CREATE SCHEMA Test GRANT SELECT ON SCHEMA::Test to ApplicationUser

7.3.6. All authorization pertaining to the database of dedicated SQL pool is added/removed via T-SQL statements - RBAC plays no part in database-level permissions, as that's only related to resource provisioning and config, not data access

7.4. Encryption

7.4.1. Transparent Data Encryption (TDE) helps protect against the threat of malicious activity by encrypting and decrypting your data at rest

7.4.2. When you encrypt your database, associated backups and transaction log files are encrypted without requiring any changes to your applications

7.4.3. TDE encrypts the storage of an entire database by using a symmetric key called the database encryption key

7.4.4. TDE is not enabled by default in dedicated SQL pool - enable the option via the Azure Portal and the pool's security

7.5. Threat Detection

7.5.1. Advanced Threat Protection is part of the Azure Defender for SQL offering, which detects anomalous activities indicating unusual and potentially harmful attempts to access or exploit databases

7.5.2. Users receive an alert upon suspicious database activities, potential vulnerabilities, and SQL injection attacks, as well as anomalous database access and queries patterns

7.5.3. Advanced Threat Protection integrates alerts with Azure Security Center, which include details of suspicious activity and recommend action on how to investigate and mitigate the threat

7.5.4. Off by default (as it's an optional service with an associated cost), you enable Azure Defender for SQL via the Synapse workspace

7.6. Auditing

7.6.1. Auditing for dedicated SQL pool tracks database events and writes them to an audit log in your Azure storage account, Log Analytics workspace, or Event Hubs

7.6.2. Helps you maintain regulatory compliance, understand database activity, and gain insight into discrepancies and anomalies that could indicate business concerns or suspected security violations

7.6.3. Auditing limitations

7.6.3.1. Hierarchical namespace for Azure Data Lake Storage Gen2 storage account is currently not supported

7.6.3.1.1. A regular Azure Blob Storage account is fine as a destination for the audit logs

7.6.3.2. Enabling auditing on a paused decicated SQL pool is not supported - to enable auditing, resume compute

7.6.4. Audit logs are written to Append Blobs in an Azure Blob storage on your Azure subscription

7.6.5. Audit logs are in .xel format and can be opened by using SQL Server Management Studio (SSMS)

7.6.6. When using Azure AD Authentication, failed logins records will not appear in the SQL audit log

7.6.6.1. To view failed login audit records, you need to visit the Azure Active Directory portal, which logs details of these events

7.6.7. An auditing policy can be defined for a specific database or as a default server policy in Azure

7.6.7.1. Auditing can be enabled at the dedicated SQL pool (database) level via Security in the Portal

7.6.7.2. Auditing can be enabled at the Synapse workspace (server) level via Security in the Portal

8. Now part of Synapse Analytics

8.1. Synapse Analytics is a new (as of January 2021) unified analytics platform that integrates multiple existing (and previously disparate) technologies, including Azure SQL Data Warehouse, Azure SQL (serverless), Spark Pools (for Databricks-style notebooks), Data Factory, ADLS Gen2 storage, etc.

8.2. Microsoft is currently (Jan 2021) providing these dedicated SQL pools via two separate services:

8.2.1. Dedicated SQL pools (formerly SQL DW)

8.2.1.1. Old mode where the SQL DW lives in an Azure SQL DB Server that can be shared with regular Azure SQL databases

8.2.1.2. Internally is a resource of type (Microsoft.SQL)

8.2.2. Azure Synapse Analytics - Dedicated SQL pool

8.2.2.1. SQL DW database inside a workspace

8.2.2.2. Internally is a resource of type (Microsoft.Synapse)

8.2.2.3. This mind map is based on looking at the dedicated SQL pool from inside Synapse workspace, but functionality is the same

9. Provisioning decisions

9.1. Synapse Analytics Workspace

9.1.1. Basics

9.1.1.1. Choose subscription and resource group

9.1.1.2. Set workspace name, region and ADLS Gen2 account to integrate with Synapse workspace

9.1.1.2.1. e.g. workspace name = argento-synapse-workspace

9.1.1.2.2. e.g. filesystem name = synapse

9.1.2. Security

9.1.2.1. Set SQL Admin username and password

9.1.2.1.1. Defines admin credentials for the Synapse workspace SQL pools

9.1.2.2. Set workspace encryption options

9.1.2.2.1. There is a (non-default) option to enable double encryption for the workspace

9.1.2.2.2. The Synapse workspace will automatically get a system-assigned managed identity created in Azure Active Directory

9.1.2.2.3. There is a (default) option to allow pipelines to access SQL pools

9.1.3. Networking

9.1.3.1. Set networking options

9.1.3.1.1. There is a (default) option to allow connections from all IP addresses

9.1.3.1.2. There is a (non default) option to enable a managed virtual network

9.1.4. Your starting point for a newly provisioned Synapse Analytic workspace will be a single built-in serverless SQL pool

9.1.4.1. I believe the idea of the serverless SQL pool is to allow you to query structured files stored in ADLS Gen2 using T-SQL and pay just for the compute consumption triggered by those queries

9.1.4.2. A new Synapse Analytics workspace will have no dedicated SQL pools or Apache Spark pools - these must be added as required

9.2. Dedicated SQL pool

9.2.1. Basics

9.2.1.1. Set dedicated SQL pool name and performance level

9.2.1.1.1. Naming options are currently restrictive (no special characters and limited to 15 chars) - I went with ArgentoSQLPool

9.2.1.1.2. The lowest possible to performance level to minimise costs is DW100c (£1.24 per hour approx)

9.2.2. Additional Settings

9.2.2.1. Choose whether or not to use existing data and set collation

9.2.2.1.1. Default option is "None" for use existing data - the other two options (Backup and Restore Point) depend on you having previously run Dedicated SQL Pool or SQL Data Warehouse

9.2.2.1.2. The default collation is SQL_Latin1_General_CP1_CI_AS

10. CTAS & CETAS

10.1. Create Table As Select (CTAS)

10.1.1. CTAS is a more customizable version of the SELECT...INTO statement

10.1.2. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement

10.1.3. CTAS is the simplest and fastest way to create and insert data into a table with a single command

10.1.3.1. CTAS is fast because it is minimally logged and a parallel operation that leverages the MPP architecture

10.1.4. Example:

10.1.4.1. CREATE TABLE [dbo].[FactInternetSales_new] WITH ( DISTRIBUTION = ROUND_ROBIN ,CLUSTERED COLUMNSTORE INDEX ) AS SELECT * FROM [dbo].[FactInternetSales];

10.1.5. You can also use CTAS to work around a number of the unsupported features

10.1.5.1. Unsupported operations in dedicated SQL pool include ANSI JOINs on UPDATE & DELETE statements, and the MERGE statement

10.1.5.2. Try to think "CTAS first." Solving a problem by using CTAS is generally a good approach, even if you're writing more data as a result

10.1.5.2.1. For example, if your use case would normally involve updating table A via a join to tables B and C, break this into three statements as follows:

10.1.6. CTAS is recommended as the best way to load external data into dedicated SQL pool using Polybase

10.2. Create External Table As Select (CETAS)

10.2.1. Creates an external table and then exports, in parallel, the results of a Transact-SQL SELECT statement to Hadoop or Azure Blob storage

10.2.2. Example:

10.2.2.1. CREATE EXTERNAL TABLE hdfsCustomer WITH ( LOCATION='/pdwdata/customer.tbl', DATA_SOURCE = customer_ds, FILE_FORMAT = pipe_delimited ) AS SELECT * FROM dbo.dimCustomer;

10.2.2.1.1. Pre-requisites:

11. Partitioning

11.1. In addition to distributions (which divides each table into 60 buckets/databases), we have table partitioning available

11.2. In most cases, table partitions are created on a date column

11.3. Partitioning is supported on all dedicated SQL pool table types; including clustered columnstore, clustered index, and heap

11.4. Partitioning is also supported on all distribution types, including both hash or round robin distributed

11.5. The primary benefit of partitioning in dedicated SQL pool is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging

11.5.1. In most cases data is partitioned on a date column that is closely tied to the order in which the data is loaded into the SQL pool

11.5.2. Partition switching can be used to quickly remove or replace a section of a table, and this operation benefits from the avoidance of transaction logging

11.6. Partitioning can also be used to improve query performance

11.6.1. A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions

11.6.1.1. This method of filtering can avoid a full table scan and only scan a smaller subset of data

11.6.2. With the introduction of clustered columnstore indexes, the predicate elimination performance benefits are less beneficial, but in some cases there can be a benefit to queries

11.7. Creating a table with too many partitions can hurt performance under some circumstances, especially so for clustered columnstore tables

11.7.1. When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition

11.7.2. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed

11.7.2.1. Remember that before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases

11.8. Dedicated SQL pool introduces a way to define partitions that is simpler than SQL Server

11.8.1. Partitioning functions and schemes are not used in dedicated SQL pool as they are in SQL Server

11.8.2. example:

11.8.2.1. CREATE TABLE [dbo].[FactInternetSales] ( [ProductKey] int NOT NULL , [OrderDateKey] int NOT NULL , [OrderQuantity] smallint NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = HASH([ProductKey]) , PARTITION ( [OrderDateKey] RANGE RIGHT FOR VALUES (20000101,20010101,20020101 ,20030101,20040101,20050101 ) ) );

11.8.2.1.1. Note how partition is defined in the WITH clause of a CREATE TABLE statement, and when you combine this idea with the power and ease of the new CTAS functionality you can realise how easy and powerful it is to implement partition switching on massive (multi billion row) fact tables

11.9. Partition switching

11.9.1. Dedicated SQL pool supports partition splitting, merging, and switching

11.9.1.1. Each of these functions is executed using the ALTER TABLE statement

11.9.2. The most efficient method to split a partition that already contains data is to use a CTAS statement

11.9.2.1. Imagine we have FactInternetSales defined as above, it contains data in its 7th partition (>= 20050101) and we want to split that partition for the new year to commence with 20060101

11.9.2.1.1. 1. Use CTAS to create an empty version of fact table with the same partitioning

11.9.2.1.2. 2. Switch the 7th partition from fact table into our temporary table (created by CTAS)

11.9.2.1.3. 3. Split the 7th partition in fact table to include the new period (20060101, which will be partition 8)

11.9.2.1.4. 4. Use CTAS to create a 2nd temporary table from the first, filtering it to the period for partition 7 (year 2005)

11.9.2.1.5. 5. Switch the 7th partition from the 2nd temporary table back in to the original fact table, which now has a full partition 7 (for year 2005) and a new empty partition 8 (ready for year 2006)

11.9.2.1.6. 6. Drop temporary tables

11.9.2.1.7. 7. Update statistics on fact table

11.10. Partition sizing

11.10.1. When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition

11.10.2. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed

11.10.2.1. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases

11.10.3. Any partitioning added to a table is in addition to the distributions created behind the scenes

11.10.4. If a fact table is given 36 monthly partitions, each dedicated SQL pool has 60 distributions, so when the table is full you are expecting 1 million x 60 x 36, which is 2.1 billion rows

11.10.4.1. If your "full" table size falls short of this 2.1 billion row mark, consider using coarser partitioning ranges to reduce the number of partitions

11.11. Partitioning source control (e.g. Git) idea

11.11.1. Idea is to create fact tables as partitioned but with no partition values

11.11.2. You have a T-SQL deployment script that is scheduled for automated execution within your deployment pipeline, which defines the data partitions and runs some dynamic SQL to alter the fact table(s) and split the ranges appropriately

12. Clustered columnstore indexes

12.1. Columnstore indexes are the standard for storing and querying large data warehousing fact tables

12.2. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage

12.2.1. You can also achieve gains up to 10 times the data compression over the uncompressed data size

12.3. Clustered columnstore indexes define the way the entire table is stored on disk, and we can understand this at a high level by imagining a process happening against our common understanding of a table

12.3.1. 1. Imagine a sales fact table with a number of columns and further imagine that this table holds over 100 million rows

12.3.1.1. 2. The table is horizontally divided into rowgroups of 1 million rows each

12.3.1.1.1. 3. Each rowgroup is divided vertically by column, and each of these separate column storage structures within each rowgroup is called a segment

12.4. When a query occurs against a table with a clustered columnstore index, the query optimizer can realise performance gains via rowgroup elimination and segment elimination

12.4.1. Only the segments from specific rowgroups are retrieved from disk into the buffer pool to satisfy the query

12.5. To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a btree list of IDs for deleted rows

12.5.1. A deltastore is a collection of one or more delta rowgroups, and there is only one deltastore per clustered columnstore index

12.5.1.1. A delta rowgroup is a clustered B-tree index that's used only with columnstore indexes

12.5.1.1.1. It improves columnstore compression and performance by storing rows until the number of rows reaches a threshold (1,048,576 rows) and are then moved into the columnstore

12.5.1.1.2. When a delta rowgroup reaches the maximum number of rows, it transitions from an OPEN to CLOSED state

12.5.1.1.3. A background process called the tuple-mover automatically detects CLOSED delta rowgroups and moves each of them into a new COMPRESSED rowgroup in the columnstore

12.5.2. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore

12.5.3. During a large bulk load, most of the rows go directly to the columnstore without passing through the deltastore

12.5.3.1. Some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup, which is 102,400 rows

12.5.3.1.1. As a result, the final rows go to the deltastore instead of the columnstore

12.5.3.2. For small bulk loads with less than 102,400 rows, all of the rows go directly to the deltastore

12.6. Having too many small rowgroups decreases the columnstore index quality

12.6.1. This can happen when a lot of rows originally stored in the columnstore have been deleted

12.6.2. Until SQL Server 2017 (14.x), a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups

12.6.2.1. Starting with SQL Server 2019 (15.x), a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted

12.6.2.2. Not sure what applies to dedicated SQL pool, which may require a reorganize operation at the time of writing (Jan 2021) but one can imagine will benefit from the background merge task at some point

12.6.3. sys.dm_db_column_store_row_group_physical_stats is the main DMV needed to check what's going on with the rowgroups in your clustered columnstore index

12.6.3.1. Click link to see a great idea for a view creation in order to check out the current quality of your clustered columnstore index

13. Workload management

13.1. Running mixed workloads can pose resource challenges on busy systems, and ways are needed to separate certain activities (such as ETL processes) that have SLAs

13.2. A data warehouse workload refers to all operations that transpire in relation to a data warehouse

13.2.1. The entire process of loading data into the warehouse

13.2.2. Performing data warehouse analysis and reporting

13.2.3. Managing data in the data warehouse

13.2.4. Exporting data from the data warehouse

13.3. The performance capacity of a data warehouse is determined by the data warehouse units

13.4. Resource classes

13.4.1. In the past, for Synapse SQL in Azure Synapse you managed the query performance through resource classes

13.4.2. Resource classes enable compute resource assignment (CPU, memory, I/O) to a query based on the user's database role membership

13.4.2.1. Resource classes alone do not provide the required governance for managing workloads and SLAs, which is why workload classification, importance and isolation were introduced

13.4.3. There are 8 static resource classes that are implemented with these pre-defined database roles:

13.4.3.1. staticrc10

13.4.3.1.1. Queries running under users with this role membership always consume 1 concurrency slot across all service objectives (i.e. DW100c to DW30000c)

13.4.3.2. staticrc20

13.4.3.2.1. Queries running under users with this role membership always consume 2 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.3. staticrc30

13.4.3.3.1. Queries running under users with this role membership always consume 4 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.4. staticrc40

13.4.3.4.1. Queries running under users with this role membership consume between 4 and 8 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.5. staticrc50

13.4.3.5.1. Queries running under users with this role membership consume between 4 and 16 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.6. staticrc60

13.4.3.6.1. Queries running under users with this role membership consume between 4 and 32 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.7. staticrc70

13.4.3.7.1. Queries running under users with this role membership consume between 4 and 64 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.3.8. staticrc80

13.4.3.8.1. Queries running under users with this role membership consume between 4 and 128 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.4. There are 4 dynamic resource classes that are implemented with these pre-defined database roles:

13.4.4.1. smallrc

13.4.4.1.1. Queries running under users with this role membership consume between 1 and 36 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.4.2. mediumrc

13.4.4.2.1. Queries running under users with this role membership consume between 1 and 120 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.4.3. largerc

13.4.4.3.1. Queries running under users with this role membership consume between 1 and 264 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.4.4. xlargerc

13.4.4.4.1. Queries running under users with this role membership consume between 1 and 840 concurrency slots across all service objectives (i.e. DW100c to DW30000c)

13.4.5. Your operations that are typically resource heavy are governed by resource classes, such as SELECT, INSERT, DELETE, CREATE INDEX, etc.

13.4.6. DDL and sys admin operations always run via smallrc even when run under users with membership of higher resource classes

13.4.7. Users can be members of multiple resource classes

13.4.7.1. Dynamic resource classes take precedence over static resource classes

13.4.7.2. Larger resource classes take precedence over smaller resource classes

13.5. Concurrency slots

13.5.1. Concurrency slots are a convenient way to track the resources available for query execution; they are like tickets that you purchase to reserve seats at a concert because seating is limited

13.5.2. The number of concurrency slots available depends on DWUs (service objective) + resource class

13.5.2.1. For static resource classes you can have up to 128 concurrent queries and with dynamic resource classes the concurrent query limit is 32

13.6. Workload classification and groups

13.6.1. Workload groups replace resource classes

13.6.1.1. The old built-in resource class database roles remain but these have also become reserved names for pre-defined workload groups

13.6.1.1.1. Additionally, we can create custom workload groups using the CREATE WORKLOAD GROUP statement

13.6.1.1.2. The idea is to support resource classes for backward compatibility but once you start defining workload classifiers, you are expected to drop the user mappings to these database roles using sp_droprolemember

13.6.1.2. With workload groups we get to reserve minimum resources expressed as a percentage of total capacity

13.6.1.3. Query assignment to workload groups is based on workload classification, which is defined via the CREATE WORKLOAD CLASSIFICATION statement

13.6.1.3.1. We must specify the WORKLOAD_GROUP (which will either need to be one of the legacy resource class names or a custom one created via the CREATE WORKLOAD GROUP statement)

13.6.1.3.2. We must also specify the MEMBERNAME option as a minimum when creating a workload classification

13.6.1.3.3. Other options for workload classification:

13.6.2. Workload isolation reserves resources for a workload group

13.6.2.1. Resources reserved in a workload group are held exclusively for that workload group to ensure execution

14. Backups

14.1. Snapshots & restore points

14.1.1. Use dedicated SQL pool restore points (created via snapshots) to recover or copy your data warehouse to a previous state in the primary region

14.1.2. To recover from a data corruption issue or to provision copies of the SQL pool for other environments

14.1.3. A data warehouse snapshot creates a restore point you can leverage to recover or copy your data warehouse to a previous state

14.1.4. Since dedicated SQL pool is a distributed system, a data warehouse snapshot consists of many files that are located in Azure storage

14.1.5. Snapshots capture incremental changes from the data stored in your data warehouse

14.1.6. A data warehouse restore is a new data warehouse that is created from a restore point of an existing or deleted data warehouse

14.1.7. Snapshots is a feature that automatically creates restore points throughout the day

14.1.7.1. However, if you frequently pause compute, be aware that the snapshot feature is also paused so you may lose out on automatic restore points that increase your RPO window

14.1.7.1.1. RPO = Recovery Point Objective, which refers to the window of time of tolerable data loss from the time of some failure that caused data loss/corruption, and it's tied to the frequency of our backups

14.1.7.2. You can also create user-defined restore points

14.1.7.2.1. This feature enables you to manually trigger snapshots to create restore points of your data warehouse before and after large modifications

14.1.7.2.2. It's a good idea to create one before pausing compute, assuming that prior to the pausing of compute you just completed a batch load into the dedicated SQL pool

14.1.7.2.3. You can have a maximum number of 42 user-defined restore points, after which you must delete restore points before creating more

14.1.7.2.4. You can trigger snapshots to create user-defined restore points through PowerShell or the Azure portal

14.1.7.3. For each available snapshot you will be able to select any point of time within a time window for restoring your copy of the dedicated SQL pool

14.1.8. Snapshots are retained for 7 days and then automatically deleted

14.1.8.1. You cannot change this retention period

14.1.8.2. Dedicated SQL pool deletes a restore point when it hits the 7-day retention period and when there are at least 42 total restore points (including both user-defined and automatic)

14.1.8.2.1. If you use the pause compute feature a lot, this will limit the number of automatic restore points created, thereby keeping your total number of snapshots below 42, which means snapshots can persist beyond the 7 day window under such circumstances

14.1.8.2.2. The absolute maximum possible number of available snapshots at any one time is 84: 42 automatic (which would naturally happen in 7 day window if compute is never paused) + 42 user-defined

14.1.9. Restoring a copy of dedicated SQL pool from a restore point (automatic or user-defined) can be initiated in the portal via the dedicated SQL pool Overview window, and the option to create user-defined restore points is available from the same location

14.2. Geo-backup & restore

14.2.1. Use data warehouse geo-redundant backups to restore to a different geographical region

14.2.2. For disaster recovery scenario

14.2.3. A geo-backup is created once per day to a paired data center

14.2.4. The RPO for a geo-restore is 24 hours

14.2.5. You can restore the geo-backup to a server in any other region where dedicated SQL pool is supported

14.2.6. A geo-backup ensures you can restore data warehouse in case you cannot access the restore points in your primary region

14.2.7. If you do not require geo-backups for your dedicated SQL pool, you can disable them and save on disaster recovery storage costs

14.2.7.1. Geo-backups are enabled by default, and can be changed in the portal via dedciated SQL pool Settings | Geo-backup policy page

14.2.7.1.1. Bear in mind that geo-backups will not be taken when compute is paused

14.2.7.2. As an alternative to geo-backup, you can also create a user-defined restore point and restore from the newly created restore point to a new data warehouse in a different region

14.2.7.2.1. Once you have restored, you have the data warehouse online and can pause it indefinitely to save compute costs

14.2.7.2.2. The paused database incurs storage charges at the Azure Premium Storage rate

or Sign Up