Azure Databricks Basics

Get Started. It's Free
or sign up with your email address
Azure Databricks Basics by Mind Map: Azure Databricks Basics

1. Databricks

1.1. The commercial enterprise behind Apache Spark

1.2. Developed by original team behind Apache Spark - several original researchers/developers still involved with Databricks

1.3. Databricks the company has several products, one of which is named the Databricks Unified Analytics Platform (UAP)

1.3.1. When we talk about Azure Databricks we are talking about the Databricks UAP that is hosted in Azure

1.3.2. Databricks UAP is also available via Amazon Web Services (AWS)

1.3.3. Databricks UAP is cloud-based environment for hosting Apache Spark

1.3.4. Databricks UAP has its own data lake called Delta Lake

1.3.4.1. Delta Lake is a transactionally consistent data lake

1.4. Databricks is a fully supported third-party service in Azure, not just a marketplace offering

1.4.1. Microsoft has worked closely with Databricks on this

1.4.1.1. Azure Data Factory data flows are built on Databricks UAP (clusters are spun up in the background to provide the compute resource for data flows)

1.5. Databricks has strong features for machine learning but it is more than that and actually has all the elements needed to deliver a modern data warehouse, including strong data transformation abilities and SQL query support

2. Provisioning a new Databricks service

2.1. All Services | Azure Databricks

2.1.1. Create Azure Databricks service

2.1.1.1. 1. Select Subscription

2.1.1.1.1. e.g. Pay-as-you-go

2.1.1.2. 2. Select Resource group

2.1.1.2.1. e.g. argento-internal-training

2.1.1.3. 3. Set Workspace name

2.1.1.3.1. e.g. argento-databricks

2.1.1.4. 4. Select Location

2.1.1.4.1. e.g. UK South

2.1.1.5. 5. Select Pricing tier

2.1.1.5.1. e.g. Standard

2.1.1.6. 6. (Optional) Deploy Azure Databricks in your own virtual network (VNet)

2.1.1.7. 7. (Review + Create) Click Create

2.1.1.7.1. Consider option for creating ARM template if your intention is to spin up cluster and decommission it over and over, or you want to promote it to a downstream environment

2.2. When you provision an Azure Databricks service, you don't yet have a cluster - these are launched via the workspace

3. Launch Databricks workspace

3.1. To start Databricks development, you need to launch a workspace for your provisioned Azure Databricks service, and this will launch in a separate browser tab, much like it does for Azure Data Factory

3.2. Azure Databricks | Launch Workspace

3.2.1. Click Explore the Quickstart Tutorial

3.2.1.1. Follow the instructions in the notebook

3.2.1.1.1. The Quickstart Tutorial notebook has SQL as its default language

3.2.1.1.2. The cells of the notebook demonstrate the following concepts

4. Databricks components

4.1. Workspace

4.1.1. Home for all other Databricks objects (e.g. notebooks, clusters, tables, etc.)

4.1.2. We need to create at least one workspace before creating anything else

4.1.3. Workspaces may be split by environment (DEV, TEST, PROD, etc.) or by teams, for example

4.2. Clusters

4.2.1. Databricks builds Spark clusters for us (which is a great benefit vs manual provisioning of Apache Spark on-prem)

4.2.1.1. We define a few things for Databricks to build the cluster:

4.2.1.1.1. Size of each machine (driver and workers)

4.2.1.1.2. Version of Databricks Runtime

4.2.1.1.3. Range for auto-scaling worker nodes

4.2.2. Each cluster has one driver and one or more workers

4.2.2.1. Workers are also known as executors

4.2.2.2. Driver controls jobs and sends commands to workers

4.2.2.3. Drivers should be powerful VMs in production clusters

4.2.2.3.1. Generally drivers will be the most busy of the nodes in a typical cluster and the first node you should look to scale up for busy clusters

4.2.3. By default clusters terminate after 120 minutes of inactivity

4.2.3.1. This can be changed and you also have the option to turn the option off so that the cluster never auto-terminates due to inactivity

4.2.3.2. The longer a cluster runs for, the more cost will be accrued, so keeping it running 24x7 should only be reserved for Production environments that have workloads sufficiently continuous to justify it

4.2.4. Provisioning a new cluster

4.2.4.1. 1. Set Cluster Name

4.2.4.1.1. e.g. argento-db-cluster1

4.2.4.2. 2. Select Cluster Mode

4.2.4.2.1. Options:

4.2.4.3. 3. Select Pool

4.2.4.3.1. Default choice is None

4.2.4.3.2. Pools are a way to reduce cluster start-up time

4.2.4.4. 4. Select Databricks Runtime Version

4.2.4.4.1. Runtimes with "ML" in the name will come with machine learning modules built in

4.2.4.4.2. GPU in runtime name means cluster nodes will use GPU instead of CPU

4.2.4.5. 5. Set Autopilot Options

4.2.4.5.1. Enable autoscaling

4.2.4.5.2. Terminate after X minutes of inactivity

4.2.4.6. 6. Select configuration for Worker and Driver machines

4.2.4.6.1. Default will be a General Purpose Standard, and is intended to provide balanced systems (i.e. sensible balance of memory, processor cores, etc.)

4.2.4.6.2. Standard General Purposes uses SSD for disk, General Purpose (HDD) will use magnetic disk, reducing VM cost but slower on I/O operations

4.2.4.6.3. Memory optimized provides configs with balance tilted towards memory, which is good when your workloads demand more memory intensive operations

4.2.4.6.4. Storage optimized provides configs with balance tilted towards storage capacity, which is good when your workloads demand more I/O intensive operations

4.2.4.6.5. Compute optimized provides configs with balance tilted towards CPU capacity, which is good when your workloads demand more CPU intensive operations

4.2.4.6.6. GPU accelerated provides configs that are only available when you choose one of the GPU runtimes

4.2.4.7. 7. Click Create Cluster

4.2.4.7.1. Normally takes around 5 mins to provision a standard cluster

4.2.4.7.2. Optional Advanced Config options include:

4.2.4.8. For automation, you can use the JSON option, and adapt it with parameters

4.3. Notebooks

4.3.1. Provide our primary means of interaction with Databricks clusters

4.3.2. We write our code in notebooks and also create our documentation here

4.3.3. Control operations are made through notebooks - there is no console interface

4.3.4. Supports 4 primary languages: Scala, Python, R, SQL

4.3.4.1. Supports some secondary languages too

4.3.4.2. We can switch languages inside a single notebook

4.4. Libraries

4.4.1. Databricks allows you to incorporate third-party libraries in your clusters

4.4.1.1. Options include:

4.4.1.1.1. Build and load custom JAR files

4.4.1.1.2. Import WHL packages

4.4.1.1.3. R libraries

4.4.1.1.4. PyPI

4.4.1.1.5. CRAN (for R)

4.5. Folders

4.5.1. Use to organise notebooks and other resources

4.6. Jobs

4.6.1. Databricks jobs allow you to schedule automated tasks such as data loads

4.6.2. Job clusters tend to be less expensive than interactive clusters, so it's better to perform regular tasks as scheduled jobs instead of running them manually

4.6.3. When you create a new job, you can tie it to existing notebook

4.7. Data

4.7.1. When we use SQL CREATE TABLE in a Databricks notebook, this creates a table, which a structured collection of data

4.7.2. A database in Databricks is just a collection of tables

4.7.3. See attached for screenshot of diamonds table created via the Quickstart notebook

4.7.4. Interestingly, data exists independently of clusters and therefore attracts storage costs independently of clusters

4.7.4.1. However, data can only be access via an active cluster

4.7.5. When you select Data, you get an option to Add Data (as well as view existing databases and tables)

4.7.5.1. You can create new tables via a local file upload (to DBFS)

4.7.5.1.1. DBFS is the Databricks File System

4.7.5.1.2. DBFS follows the same concept of distributed storage as HDFS (Hadoop's distributed file system)

4.7.5.1.3. Uploaded local files go into /FileStore/tables/ (in DBFS)

4.7.5.2. You can create new tables from files already loaded to DBFS

4.7.5.3. You can create tables from other data sources

4.7.5.3.1. Azure Blob Storage and Azure Data Lake are two popular choices here

4.7.5.3.2. If your data source is a database of some sort (e.g. SQL Database), then you should use the JDBC option

5. Databricks File System (DBFS)

5.1. Distributed file system

5.2. Patterned after Hadoop Distributed File System (HDFS)

5.3. Built on top of blob storage

5.3.1. Azure Blob Storage

5.3.2. Amazon S3

5.4. Supports additional blob storage mount points

5.4.1. Allows you to store you data in Blob storage with a DBFS mount reference, avoiding the need to copy the data into DBFS

6. Deleting data

6.1. Remember that any files uploaded to DBFS or tables created using the UI or a notebook will continue to attract storage costs even when your cluster is terminated or deleted

6.2. Dropping tables

6.2.1. %sql

6.2.2. DROP TABLE <table_name>;

6.3. Listing DBFS files

6.3.1. dbutils.fs.ls('directory/sub-directory')

6.4. Deleting DBFS files

6.4.1. dbutils.fs.rm('directory/sub-directory/file-name

7. Visualisations and Dashboards

7.1. By default, DataFrames are presented in Notebooks as tables, but 11 other visualisations are available

7.1.1. Pivot

7.1.1.1. Presents data in pivot table format

7.1.1.2. Good for looking at data aggregations (sum, average, count, etc) by two dimensions

7.1.1.2.1. If you put two or more attributes on a row or column, the values will just be concatenated - you won't get hierarchical navigation like you would with Excel

7.1.2. Bar & Pie charts

7.1.2.1. Good for categorical data - i.e. aggregations spread out by a relatively small number of categories for the purpose of comparing those categories

7.1.2.2. Bar charts are generally considered to be preferable to pie charts as they hold greater analytic value

7.1.3. Line charts

7.1.3.1. Good for looking at aggregated data over time

7.1.4. Scatter Plot charts

7.1.4.1. Good for looking at correlations between two numeric data points in a set (e.g. how correlated is diamond price to diamond size)

7.1.5. Histogram charts

7.1.5.1. Good for looking at a single numeric measure and you are interested to see which numeric values (within "bins") occur the most or least, and patterns in how the values occur across the total range

7.1.5.1.1. Bins are equal sized sub-ranges that divide the total range of values from minimum to maximum value

7.1.6. Box Plot charts

7.1.6.1. Good for comparing distributions of some measure (e.g. exam score) by some useful attribute (e.g. year)

7.1.6.2. Visualises percentiles in a "box" - e.g. 25th percentile, 50th percentile, 75th percentile

7.1.6.2.1. "Whiskers" either side of box show the "normal" distribution to min and max

7.1.6.2.2. "Outliers" are data points outside of normal distribution (statistical anomalies)

7.1.7. Map charts

7.1.7.1. Good for visualising geographical data

7.1.8. You can also import additional visuals using Python libraries (e.g. bokeh) or R (ggplot)

8. Languages and Libraries

8.1. Supported languages

8.1.1. Scala

8.1.1.1. Pros

8.1.1.1.1. Terse, well-supported language (it is the native language for Spark)

8.1.1.1.2. Best support for native Spark packages

8.1.1.2. Cons

8.1.1.2.1. Least known outside of Spark

8.1.1.2.2. No support for high-concurrency clusters

8.1.2. Python

8.1.2.1. Pros

8.1.2.1.1. Most popular "core" language

8.1.2.1.2. Great support in data science, especially neural networks

8.1.2.1.3. Extensive set of libraries for general-purpose computing

8.1.2.2. Cons

8.1.2.2.1. Often the slowest language

8.1.3. SQL

8.1.3.1. Pros

8.1.3.1.1. Most common language

8.1.3.1.2. Spark has great SQL support

8.1.3.1.3. Separate Catalyst engine can optimize query performance

8.1.3.2. Cons

8.1.3.2.1. SQL is a domain-specific language

8.1.3.2.2. SQL is often a "secondary" language

8.1.4. R

8.1.4.1. Pros

8.1.4.1.1. Excellent support for data science work

8.1.4.1.2. Extensive third-party ecosystem

8.1.4.1.3. Key language

8.1.4.2. Cons

8.1.4.2.1. R is a domain-specific language

8.1.4.2.2. Not all packages support multi-core processing, much less multi-server

8.2. Libraries

8.2.1. External libraries can be added in 5 different ways:

8.2.1.1. Upload files (from your local file system)

8.2.1.2. Load libraries from DBFS (server-side file system of Databricks)

8.2.1.3. Grab Python libraries from PyPi

8.2.1.4. Grab Scala libraries from Maven coordinates

8.2.1.5. Grab R libraries from CRAN

8.2.2. Library modes

8.2.2.1. This refers to the scope of a library that you add to Databricks

8.2.2.2. Workspace-level libraries

8.2.2.2.1. Make libraries available to all users and all clusters

8.2.2.2.2. Put libraries in the Shared folder for multi-user support

8.2.2.2.3. Put libraries in specific Users folder for single-user libraries

8.2.2.2.4. Libraries persist even if you delete all your clusters

8.2.2.3. Cluster-level libraries

8.2.2.3.1. Add to running clusters, the Libraries tab features an Install New button for this

8.2.2.3.2. Libraries will persist if you stop a cluster (and later restart) but will be lost if you delete the cluster

8.2.2.4. Notebook-level libraries

8.2.2.4.1. Certain Python libraries can be installed on a per notebook basis

8.2.2.4.2. Other libraries installed via notebooks will be installed as cluster-level libraries

8.2.3. Creating libraries

8.2.3.1. New shared workspace library

8.2.3.1.1. Library creation options

8.2.3.2. New user-specific workspace library

8.2.3.3. New library in an R notebook (i.e. notebook scoped library)

8.2.3.3.1. This example installs an R library named "gapminder" in a notebook attached to a cluster that did not already have this library installed

8.2.3.3.2. Note that the attached cluster did already have the benford.analysis library installed, which is why the notebook only includes the library() reference call rather than the install.packages()

8.2.3.3.3. Note that installing a library from a notebook means it will be installed on the attached cluster but the library will not show up on the cluster as one of its installed external libraries

8.2.3.4. New library in a Python notebook (i.e. notebook scoped library)

8.2.3.4.1. The example (in screenshot) was used to demonstrate the notebook import feature

8.2.3.4.2. What you actually need to install a notebook scoped library is to use the %pip magic or the %conda magic

8.2.3.4.3. Note: if you use Python import without the library being pre-installed on the cluster, or without a preceding cell that installs it via the %pip or %conda magic, or the dbutils command, you will get an error to the effect that the library is not found

8.2.3.5. Best practices

8.2.3.5.1. Consider different clusters for different use cases

8.2.3.5.2. Adding too many external libraries to a cluster can have a substantial impact on time it takes to spin cluster up

8.2.3.5.3. Some use cases may require older versions of a particular library and this will require separate clusters as you cannot have two different versions of same external library installed simultaneously on single cluster

8.2.4. Library management

8.2.4.1. Moving libraries to different folders

8.2.4.1.1. It's easy for libraries stored in DBFS to become messy in terms of file system organisation, so we can move them

8.2.4.1.2. For example, we can create a Libraries sub-folder under Shared and move a bunch of libraries to this

8.2.4.1.3. Moving library will not affect existing cluster installations

8.2.4.2. Deleting libraries

8.2.4.2.1. Removing external libraries from a cluster is done by selecting and clicking the Uninstall option

8.2.4.2.2. "Uninstalling" is slightly misleading as what happens is that the library is removed from the list of external libraries

8.2.4.2.3. When the cluster is restarted, the external library will no longer be installed

8.2.4.3. Upgrading libraries

8.2.4.3.1. There is no automated updates for libraries

8.2.4.3.2. Basically you need to remove the library and then recreate it (e.g. from PyPI)

9. Integration with Azure Data Factory

9.1. ADF supports the following pipeline activities for Databricks

9.1.1. Notebook

9.1.2. Jar

9.1.3. Python

9.2. To enable ADF to connect to Databricks, we need to create an access token for it in Databricks

9.2.1. In Databricks workspace, click on user icon (top right) and choose User Settings

9.2.2. In User Settings we have the option to generate a new access token

9.2.3. When generating new access token, we give a description (of token's purpose) and an expiry time in days

9.2.3.1. Default expiry period is 90 days but you can change this

9.2.3.2. Good description will identify the service or application that will use the access token for authentication and authorization when connecting to the Databricks service

9.2.4. The new token must be copied immediately as it will no longer be available after the dialog window is closed

9.3. In ADF, we need to create a linked service for Databricks

9.3.1. When creating a linked service, Databricks is one of the options under Compute

9.3.2. In the config properties for the new Databricks linked service, we paste in the secret access token value from the token we generated earlier in Databricks

9.4. In ADF, we create a pipeline with a Databricks Notebook activity

9.4.1. In configuring the Databricks Notebook activity, we bind it to the linked service

9.4.2. In configuring the Databricks Notebook activity we bind it to a notebook hosted in our Databricks workspace

9.5. We can test the pipeline via a manual trigger

9.5.1. The pipeline can be monitored in ADF

9.5.2. We can also observe the job cluster spinning up in Databricks

10. Connecting Databricks to Power BI

10.1. The supported method for this is to connect to Databricks tables as a source for Power BI

10.1.1. This requires an active, running Databricks cluster

10.2. In the Power BI Get Data dialog, there is an option for Azure Databricks under the Azure group

10.3. You will be prompted to enter the server host name and HTTP path

10.3.1. These values can be retrieved via the Databricks UI

10.4. You will be prompted for authentication, and one of the options is to use a (Databricks) personal access token

10.4.1. We can also go with Azure Active Directory authentication

10.5. When you connect, you can see all available Databricks tables and you can select any for loading

10.6. Once connected, you can build out Power BI reports sourced from Azure Databricks tables

11. Databricks Secrets API

11.1. Secrets are sensitive information, such as passwords, access tokens and connection strings, that we don't want to appear in plain text within Databricks notebooks

11.2. Secrets for use in Databricks notebooks can be stored in one of two places: Databricks Secrets or Azure Key Vault

11.2.1. Using the Databricks Secrets API, we can create, retrieve and delete secrets stored in Databricks

11.2.1.1. We can also read secrets from Azure Key Vault (which has its own UI via Azure Portal, plus a separate REST API that can also be used via separate PowerShell module)

11.3. Databricks hosted secrets can have access control lists (ACLs) configured for premium tier workspaces

11.3.1. Permissions: Manage, Write, Read

11.4. Creating a secret in Azure Key Vault

11.4.1. Very easy to provision Azure Key Vault and add a new secret, such as the access token for an ADLS Gen2 storage account

11.5. Creating a secret using Databricks Secrets API and PowerShell

11.5.1. In order to access the Databricks Secrets API and successfully invoke commands, you need to authenticate

11.5.2. Authentication to the Databricks Secrets API can be made using Databricks personal access tokens

11.5.3. PowerShell console

11.5.3.1. In order to use the Databricks Secrets API via PowerShell, you'll need to install and import the DatabricksPS module and ensure the execution policy is set to RemoteSigned

11.5.3.1.1. Once this is taken care of via PowerShell run as Administrator, you should be able to run commands via non-Admin PowerShell session

11.5.3.2. Create a couple of variables, one for the Databricks personal access token and one for the API root URL

11.5.3.2.1. PowerShell variables start with $ and are not case sensitive

11.5.3.3. Set-DatabricksEnvironment

11.5.3.3.1. This is always the first cmdlet to call in DatabricksPS as it establishes the authenticated connection to the Databricks API

11.5.3.4. Get-DatabricksSecretScope

11.5.3.4.1. Check for the existence of any secrets stored in the Databricks secret scope and list those out

11.5.3.5. Add-DatabricksSecretScope

11.5.3.5.1. Create a new secret scope hosted in Databricks

11.5.3.5.2. Note that to avoid the prompt seen in the screenshot, I should have included parameter -InitialManagePrincipal "users"

11.5.3.6. Add-DatabricksSecret and Get-DatabricksSecret

11.5.3.6.1. Add a new secret to a specific Databricks secret scope and then list out the secrets in that scope to verify it's been added

11.5.3.6.2. If you reference a scope that is Azure Key Vault backed, you'll get an error because the Databricks Secrets API only permits read-only operations with Azure Key Vault

11.5.3.7. Link an Azure Key Vault instance to Databricks secret scopes

11.5.3.7.1. Rather than via PowerShell, this is done via a hidden part of the Azure Databricks UI

11.5.3.8. Remove-DatabricksSecret

11.5.3.9. Remove-DatabricksSecretScope

11.6. Accessing secrets in a notebook

11.6.1. We declare a variable and assign it a value using dbutils.secrets.get()

11.6.1.1. We pass two string arguments for dbutils.secrets.get(): scope and key

11.6.1.2. See attached example

11.6.1.2.1. Note that variable assignment syntax is identical between Scala and Python apart from Scala requiring the keyword "val" as a prefix

11.6.1.2.2. Note that the secret held in a variable returns "[REDACTED]" if you try to print it

11.6.1.2.3. If someone has access to the Databricks workspace and it's Standard tier (meaning all users have full access with no way to lock access down with finer grained control), you can work around secret redaction by using a simple for loop to enumerate the secret, character by character

12. Apache Spark

12.1. Technology that Databricks is built on

12.2. Free, open source project that implements distributed, in-memory clusters

12.2.1. Part of Apache Hadoop ecosystem

12.2.2. Originally developed at University of Berkley and rapidly became one of the largest open source projects in the world

12.3. Spark DataSets

12.3.1. introduced with Spark 2.0

12.3.2. These are strongly-typed RDDs

12.3.2.1. Type can be a class

12.3.2.1.1. e.g. Employee or Invoice

12.3.2.2. Can be a collection of primitive data types

12.3.2.2.1. e.g. string, int, etc.

12.3.2.3. Can be a single data type

12.4. Spark DataFrames

12.4.1. DataSets with named columns

12.4.2. Structure starts to resemble a table

12.4.3. Key data structure for Spark SQL

12.5. Spark SQL

12.5.1. ANSI-compliant SQL statements can be run on a Spark cluster using DataFrames

12.5.2. Uses its own cost-based optimizer called Catalyst

12.5.2.1. This is separate and distinct to Spark's own cost-based optimizer that it uses when converting Scala (or Python, Java) code into an execution plan

12.6. Spark Streaming

12.6.1. Based on concept of Discretized Streams, which build RDDs over very small windows (typically milliseconds) and process each as independent batch

12.6.1.1. Known as the microbatch approach to streaming, which makes streaming easier to understand

13. Resilient Distributed Datasets (RDDs)

13.1. Dataset is unstructured collection of data

13.1.1. Don't confuse with Spark DataSets

13.2. Distributed means data is shared across multiple nodes and worked on concurrently, allowing linear scale

13.3. Resilient means the driver (control server) notices when an executor (processing node) fails to respond and orders another executor to take over

13.4. RDDs are immutable

13.4.1. It is not possible to modify an existing RDD, you have to create a new RDD from the first one in order to make modifications

13.4.2. Immutability is important to facilitate concurrency, bearing in mind that there will be multiple executors working on the same RDD concurrently

14. Use cases for Databricks

14.1. Relational database vs Apache Spark (or Hadoop) for processing large data sets

14.1.1. Seeks

14.1.1.1. Relational databases are fast to retrieve specific records within large datasets when configured correctly

14.1.1.2. Spark is slow on single point lookup within large datasets, with higher overhead for such operations

14.1.2. Scans

14.1.2.1. Relational databases are adequate on scan speed for large datasets, and can generally handle 10s of millions of rows quite well, when configured properly

14.1.2.2. Spark delivers great scan speed on large distributed datasets

14.1.3. Memory

14.1.3.1. Relational databases are typically memory limited with performance degrading as data size breaches memory size, resulting in excessive disk I/O

14.1.3.2. Spark uses memory across a cluster of machines

14.1.4. Clusters

14.1.4.1. Relational databases are typically bound to a single machine

14.1.4.1.1. There are clustering options for relational databases but they are more complicated to set up, limited in cluster size and fundamentally the technology is not designed for highly scalable clustering

14.1.4.2. In Spark, clusters are fundamental and it's very easy to scale out by adding nodes to a cluster

14.1.5. Scaling

14.1.5.1. Relational databases are hard to scale out and typically are scaled up in response to work load growth

14.1.5.2. Spark nodes can be scaled up too, but typically you will scale out

14.2. Apache Hadoop vs Apache Spark for processing large data sets

14.2.1. Hadoop is based on MapReduce concept, which is an algorithm for performing computations across a cluster

14.2.1.1. Hadoop's implementation of MapReduce is slow because it involves reading from disk, doing some work, writing results to disk and repeating this cycle over and over

14.2.1.1.1. Apache Tez is a built-in project that optimises Hadoop workloads by building and using Directed Acyclic Graphs (DAGs) that reduce disk activity for MapReduce jobs

14.2.2. Spark uses in-memory compute

14.2.2.1. Spark still uses MapReduce concept but all in memory rather than on disk

14.2.3. Spark supports interactive analysis, including streaming data, whereas Hadoop is fundamentally suited to batch analysis only (i.e. jobs that run for hours and produce reduced data sets for analysis or reports)

14.3. Use cases for relational databases in favour of Databricks

14.3.1. Transactional processing requiring fast writes and fast point lookups

14.3.2. Single source of truth systems that require ACID properties

14.3.2.1. Systems that have legal or regulatory requirements for correct data with no transaction loss guaranteed

14.3.3. Frequently changing data

14.3.3.1. Spark does not support updates - you can delete data and re-insert it, but this is not efficient if you have requirements for doing this frequently

14.3.4. Pre-calculated reports

14.3.4.1. This refers to the ease and familiarity of connecting relational databases to access reporting data using tools like Excel

14.3.4.2. Spark can work in conjunction with relational databases on this

14.3.4.2.1. You may have Spark prepare the report data but copy that into a relational database for the report client application to access

14.4. Use cases for Spark in favour of relational databases

14.4.1. Analytical systems

14.4.1.1. When the purpose of system is analytical, Spark excels in this area

14.4.2. Batch reporting systems

14.4.2.1. Spark is also good for batch processing of data to prepare for reporting

14.4.2.1.1. The reporting data prepared by Spark is often landed in a relational database for downstream consumption

14.4.3. Error tolerant systems that can afford small percentage loss of data without causing a serious issue

14.4.3.1. Good example would be web activity data - losing a small number of transactions relating to website clickthrough data for example would not harm the overall value of the analytic system for understanding website usage

14.4.4. Stable data sets

14.4.4.1. Best for batch data insertion with limited deletion and no modification

14.4.5. ELT approach

14.4.5.1. When the goal is to extract data from source systems as fast as possible with zero transformation work and then hand the work of processing that data off to a separate process that does not impact the source systems at all, Spark excels

14.4.5.2. For the load and transform phases, the work is parallelizable

14.5. Spark vs Azure Synapse Analytics

14.5.1. Synapse enables you to perform transformations using SQL and has the same benefits of being highly scalable for massively parallel workload distribution

14.5.1.1. SQL available in Synapse is limited to a subset of language due to the parallel nature of the solution

14.5.1.2. Spark enables you to mix languages in order to perform much more complex data transformations than would be possible using Synapse

14.6. Spark vs SSIS or ADF (for data movement)

14.6.1. SSIS has the benefit of having many developers readily available but it requires an on-prem server or an IaaS VM in Azure

14.6.2. ADF has many pros as an ELT/ETL tool when your target is n Azure, and can be expected to become increasingly feature rich

14.6.2.1. Data flows in ADF are actually based on Databricks clusters that spin up in the background, however the transformations available in ADF data flows are more limited than those available from a dedicated Databricks solution

14.7. Databricks provides a tool for both data engineering and data science

14.7.1. Data engineering is all about building ELT/ETL data pipelines to move data and/or clean data

14.7.1.1. Languages of choice are Scala, Python and SQL

14.7.2. Data science is all about the analysis of clean data produced by data engineering

14.7.2.1. Focus is on building models and applying algorithms to those data models for solving business problems, and making results available for loading into downstream systems (e.g. a data warehouse)

14.7.2.2. Languages of choice are R, Python and SQL

14.8. Spark Streaming

14.8.1. Analysing fast-moving data via near real-time ELT (microbatches)

14.8.2. Great for:

14.8.2.1. Data pipelines for real-time dashboards

14.8.2.2. Business process triggers (e.g. flagging bad orders)

14.8.2.3. Anomaly detection

14.8.2.4. Generating recommendations

15. Databricks pricing

15.1. Estimates available via the Azure pricing calculator

15.2. Total cost of ownership includes:

15.2.1. VM instances

15.2.1.1. Driver VM + Worker VMs

15.2.1.2. You pay for all the time these VMs remain provisioned and allocated to you for exclusive use

15.2.1.2.1. Even when a cluster is idle, you pay for the VMs

15.2.1.2.2. Terminating the cluster halts the VM charges

15.2.1.2.3. If you need your VMs to be up most of the time across an average month, you can save substantially on Pay-as-you-go charges by optin for the 1 or 3 year reserved pricing model

15.2.2. Databricks Units (DBUs)

15.2.2.1. Pricing per unit depends on workload and cluster type

15.2.2.2. Each VM configuration carries a DBU rating that reflects its relative performance capability and an associated hourly cost

15.2.2.3. DBU charges apply when workloads are running, not when clusters remain idle

15.2.3. Data storage

15.2.3.1. This is a general Azure cost, not Databricks specific

15.2.3.2. The more data you store for Databricks to process regularly, the more VMs typically you will provision for processing that data, so indirectly the bigger your data for processing, the bigger your VM costs will be

15.2.4. Network bandwidth

15.2.4.1. This is a general Azure cost, not Databricks specific

15.2.4.2. Azure charges for egress, which occurs when data is transferred between Azure regions or outside of Azure

15.3. Region also affects price

15.3.1. I was surprised to note that price estimate was lower simply when toggling between regions West Europe (lower price) and UK South (higher price)

15.4. See attached example of Databricks pricing calculator estimate

15.5. Databricks SKUs

15.5.1. Choose your SKU based on workload type and environment, which attracts different DBU hourly charge rates

15.5.2. Two SKU categories: Standard vs Premium

15.5.2.1. Standard is cheaper than premium

15.5.2.2. Premium gives you access to more advanced security features, including role-based access control (RBAC)

15.5.3. Three workload categories:

15.5.3.1. Data Engineering Light

15.5.3.1.1. Least expensive DBU charge, supports running pre-compiled modules

15.5.3.2. Data Engineering

15.5.3.2.1. Mid DBU charge, supports interactive notebook execution, ML and Delta Lake scenarios

15.5.3.3. Data Analytics

15.5.3.3.1. Most expensive DBU charge, supports everything including full collaboration features, Power BI intergation, etc.

15.5.3.4. Remember that Apache Spark examples will work in Databricks

15.5.3.5. It is not clear to me that you can "choose" these workload categories via cluster configuration

15.5.3.5.1. I suspect that out of the box, you have all the features and when you run processes interactively in Databricks you will attract the DBU charges associated with Data Analytics, but you can develop automated jobs that can run repeatedly and those jobs runs attract lower DBU charges depending on their nature

16. Loading data

16.1. Supported data sources include:

16.1.1. Manual upload from local machine and files already in DBFS

16.1.1.1. Delimited, ORC and Parquet files

16.1.1.1.1. Note: fixed width files not supported

16.1.1.2. More formats are supported too, including JSON, Avro and Binary files

16.1.1.2.1. Binary files looks potentially interesting as a catch-all for more complex text-based files, such as complex JSON files

16.1.2. Azure Blob Storage

16.1.3. Azure Data Lake Storage

16.1.4. Kafka

16.1.5. Redis

16.1.6. (Almost) anything supporting JDBC

16.2. Supported structures

16.2.1. DataFrames

16.2.1.1. Prerequisite structure for using Spark SQL

16.2.1.2. This is the "traditional" Spark data structure

16.2.1.3. Has a table-like structure with rows and columns, including column headers

16.2.1.4. Lots of examples on Internet

16.2.2. Delta Lake

16.2.2.1. Supports ACID transactions

16.2.2.2. Can combine streaming and batch data

16.2.2.3. Can enforce schema or allow schema drift, depending on your use case

16.2.2.4. Supports data versioning

16.2.3. Temporary Views

16.2.3.1. Exist for duration of notebook

16.2.3.2. Useful for isolated data structures

16.2.3.3. Global temporary views designed for "applications" based on multiple notebooks, all of which can share the global temporary view for the duration of the application

16.2.4. Views

16.2.4.1. Permanent, accessible by all notebooks and persist across all sessions

16.2.4.2. Metadata only, the data is not materialized (i.e. no copy of the data referenced by the view is made)

16.2.5. Tables

16.2.5.1. Data is permanently materialized

16.2.5.2. Typically stored in Hive metastore

16.2.5.3. Allows inserts only

16.2.5.3.1. If you require updates or deletes to a Databricks table, you need to drop and recreate table (i.e. clear it down and re-insert new content from scratch)

16.2.6. Delta Tables

16.2.6.1. Tables created with keywords USING DELTA

16.2.6.2. Allows Inserts, Updates and Deletes

16.2.6.3. Stores version history and supports temporal table queries

16.2.6.4. Delta Tables is an open source project but is unique to Databricks

16.2.6.5. Delta tables support relational database concept of primary and foreign keys but these are informational and not enforceable constraints

16.2.6.5.1. Catalyst (the query optimizer for Spark SQL) uses primary and foreign keys for producing execution plans

16.3. Import data process

16.3.1. Data | Add Data | Create New Table | Upload File

16.3.1.1. Browse and navigate to local file for upload

16.3.1.1.1. The file will be uploaded into DBFS

16.3.1.1.2. For training demo, we uploaded ratings.csv from MovieLens database, which is 0.7GB

16.3.1.2. Once file uploaded, you can use it to create a table using the UI or a notebook

16.3.1.2.1. Create table with UI

16.3.2. Bringing in data from Azure Blob Storage

16.3.2.1. Upload CSV file to Azure Blob Storage using Azure Storage Explorer

16.3.2.1.1. Data | Add Data | Create New Table | Other Data Sources

16.3.3. Bringing in data from Azure SQL Server

16.3.3.1. Data | Add Data | Create New Table | Other Data Sources

16.3.3.1.1. Set Connector to JDBC

17. Notebooks

17.1. Notebooks provide the integrated development environment (IDE) for developing in Databricks, and they also provide the means to document your development

17.1.1. You can combine code, images and documentation in one document

17.2. Notebooks help solve the replication problem

17.2.1. Scripts are usually developed iteratively, and during their lifecycle they will transfer from one developer to another

17.2.2. When script handover occurs, a common issue is that it fails to run for new developer and you have the "worked on my machine" scenario

17.2.2.1. Notebooks really help to avoid the "worked on my machine" scenario

17.2.3. Notebooks hold state like REPLs

17.2.3.1. REPL stands for Read-Evaluate-Print-Loop

17.2.3.1.1. The idea is that you read code interactively, evaluate it, print the results and loop back to the next code block to read

17.2.3.2. Applied to notebooks, each code block (cell) is read, evaluated and results displayed in the notebook, and this process is repeatable for each cell in the notebook

17.2.3.2.1. Furthermore, each code block execution can change the state of variables that other code blocks in the notebook can access

17.2.3.3. Even after the cluster has been terminated and you've logged out of Azure, you still get to see all the results of cell executions (e.g. schema details, tabular query results, visualisations, etc.)

17.2.3.4. Via the Clear menu, you have the option to clear state or results, or both, and re-run the notebook end to end

17.3. Documentation is enabled using the Markdown or HTML syntax

17.3.1. Markdown

17.3.1.1. Start cell with %md

17.3.1.2. Press ESC to switch from Edit mode to Command mode, which renders and displays your markdown in the block

17.3.1.3. # <some text>

17.3.1.3.1. Heading 1 font

17.3.1.3.2. Repeat # up to 6 times to get Headings 1 to 6

17.3.1.3.3. examples:

17.3.1.4. *<some text>*

17.3.1.4.1. Italicise text

17.3.1.4.2. example:

17.3.1.5. **<some text>**

17.3.1.5.1. Bold text

17.3.1.5.2. example:

17.3.1.6. ***<some text>>***

17.3.1.6.1. Bold + italics

17.3.1.6.2. example:

17.3.1.7. >

17.3.1.7.1. block quote

17.3.1.8. `<some text>`

17.3.1.8.1. Inline code

17.3.1.8.2. example:

17.3.1.8.3. Note: requires backtick, not single quote

17.3.1.8.4. Use 3 backticks for multiline code

17.3.1.9. -, + or *

17.3.1.9.1. bullet list

17.3.1.10. ---

17.3.1.10.1. Section divider line

17.3.1.11. 1.

17.3.1.11.1. numbered list

17.3.1.12. <>

17.3.1.12.1. Create a mailto link

17.3.1.13. Tables can be created using vertical pipes, with dashes on second line to indicate first line is a header for table

17.3.1.13.1. See attached example

17.3.2. Note: to use HTML tags in a cell, you also need to start cell with %md

17.3.2.1. Links the the <a> tag

17.3.2.1.1. See attached example

17.3.2.2. (Web) Images with the <img> tag

17.3.2.2.1. See attached example

17.4. When you create a new blank notebook, you choose its default language

17.4.1. Magics are commands that let you change the code in a cell to a language other than the notebook's default

17.4.1.1. You reference magics at top of cell with %lang

17.4.1.1.1. %python

17.4.1.1.2. %r

17.4.1.1.3. %scala

17.4.1.1.4. %sql

17.4.1.1.5. %sh

17.4.1.1.6. %fs

17.4.1.1.7. %md

17.5. When you chose the Databricks Standard SKU, the Permissions menu will be greyed out for notebooks

17.5.1. You can check your SKU (pricing tier) via Azure Portal

17.6. Comments can be added, which works in a similar way to Microsoft Word

17.6.1. Comments attach to cells and could be a perfect place for peer review comments

17.7. Keyboard shortcuts

17.7.1. You can see a list of these any time by clicking the keyboard icon

17.7.2. ESC

17.7.2.1. With cell selected and in Edit mode, this switches to Control mode

17.7.2.1.1. For cells with Markdown content (i.e. started with %md), this will trigger the markdown to render

17.7.2.1.2. For cells with code, this simply exits Edit mode and displays the code

17.7.3. Shortcuts are grouped by two modes: Edit mode and Command mode

17.7.3.1. Edit mode

17.7.3.1.1. When you are inside a cell, editing its contents

17.7.3.1.2. Press ENTER to switch selected cell to Edit mode

17.7.3.1.3. CTRL-ENTER

17.7.3.1.4. CTRL-ALT-D

17.7.3.1.5. CTRL-ALT-X

17.7.3.1.6. CTRL-ALT-C

17.7.3.1.7. CTRL-ALT-V

17.7.3.1.8. CTRL-ALT-P

17.7.3.1.9. CTRL-ALT-N

17.7.3.1.10. CTRL-ALT-F

17.7.3.1.11. ALT-ENTER

17.7.3.1.12. SHIFT-ENTER

17.7.3.1.13. CTRL-]

17.7.3.1.14. CTRL-[

17.7.3.1.15. CTRL-/

17.7.3.2. Command mode

17.7.3.2.1. When you have taken a step back from editing and are thinking about code execution, notebook navigation, or making structural changes (adding/deleting cells, etc.)

17.7.3.2.2. Press ESC to switch current cell being edited to Command mode

17.7.3.2.3. CTRL-ENTER

17.7.3.2.4. D-D

17.7.3.2.5. Shift-D-D

17.7.3.2.6. X

17.7.3.2.7. C

17.7.3.2.8. V

17.7.3.2.9. SHIFT-V

17.7.3.2.10. A

17.7.3.2.11. B

17.7.3.2.12. O

17.7.3.2.13. SHIFT-M

17.7.3.2.14. CTRL-ALT-F

17.7.3.2.15. SHIFT-ENTER

17.7.3.2.16. Z

18. Dashboards

18.1. This is a notebook feature, which allows you to develop custom dashboards that allow you to present tables and visualisations from your notebook

18.2. Create a new dashboard in your notebook

18.2.1. View | New dashboard

18.2.1.1. 1. Set name of dashboard

18.2.1.1.1. e.g. Test Dashboard

18.2.1.2. 2. Arrange items on the dashboard

18.2.1.2.1. You get all your markdown cells and all the code cell results that return tables and other visualisations

18.2.1.2.2. You can resize tiles or delete them

18.2.1.2.3. For tables and chart tiles, you can click the settings icon on the tile and set a title for the tile

18.2.1.2.4. You can change the dashboard width in order to fit more tiles horizontally

18.2.1.3. 3. Click Present Dashboard

18.2.1.3.1. Your screen presents dashboard in full screen mode

18.3. You can create multiple dashboards per notebook

19. Scheduling jobs

19.1. Scheduling jobs from notebooks

19.1.1. Databricks automatically uses non-interactive clusters for scheduled jobs

19.1.1.1. This means it won't use any clusters you may have provisioned yourself

19.1.1.2. The benefit of this is cost saving - the job clusters attract lower costs than the interactive clusters

19.1.1.3. The clusters auto terminate once notebook completes

19.1.2. From inside a notebook, click the Schedule button to create a schedule

19.1.3. Once a schedule has been created, the job appears in the Jobs list

19.1.3.1. When the status icon is green, this indicates the job is active and running right now

19.1.4. You can drill into job results when a job fails

19.1.4.1. In my case, my first job failed because the default size of my job cluster was 8 worker nodes and that exceeded my quota

19.1.4.1.1. This is easily remedied by editing the job and then the job cluster to reduce the number of worker nodes

19.1.4.2. If the notebook has a dependency on an external library, it won't be sufficient for that library to be installed on your interactive cluster(s), which is another common cause of job failure

19.1.4.2.1. This is fixed by editing the job and clicking the option to add dependent libraries

19.2. You can create multiple schedules for a single notebook

19.2.1. Typical use case for this is when a notebook uses parameters and you want different scheduled runs to use different parameter valuyes

19.3. You can delete jobs from the Jobs list if they are no longer needed

19.4. You can edit jobs and remove their schedule

19.4.1. This means that they can only be run manually via the UI

19.4.2. This also means that the job can be invoked using a REST API call

19.4.2.1. Think Data Factory pipeline invoking a Databricks job at the appropriate moment in a daily batch process

19.5. Databricks saves information about job runs for 60 days

19.6. Scheduling jobs outside of notebooks

19.6.1. Databricks allows you to schedule and execute packaged Java code for Spark, packaged in a JAR file

19.6.2. Under Jobs, you have options to set JAR or configure spark-submit

19.6.2.1. I think you use the set JAR dialog when the JAR file does not yet exist in DBFS and the configure spark-submit if it's already in DBFS

19.6.3. See attached example we used in the PragmaticWorks course

19.6.4. Once created, the job can be run like any other job, either via a schedule or manually, and results reviewed via the logs

20. Databricks REST API

20.1. Databricks has multiple REST APIs, including the following:

20.1.1. Clusters API

20.1.1.1. Some of the actions supported via the API:

20.1.1.1.1. List

20.1.1.1.2. Get

20.1.1.1.3. Create

20.1.1.1.4. Edit

20.1.1.1.5. Start / Restart

20.1.1.1.6. Terminate / Delete

20.1.2. DBFS API

20.1.2.1. Some of the supported actions via the API:

20.1.2.1.1. List

20.1.2.1.2. Mkdirs

20.1.2.1.3. Create

20.1.2.1.4. Delete

20.1.2.1.5. Move

20.1.2.1.6. Read

20.1.3. Jobs API

20.1.3.1. Some of the supported actions via the API:

20.1.3.1.1. Create

20.1.3.1.2. List

20.1.3.1.3. Get

20.1.3.1.4. Delete

20.1.3.1.5. Run Now

20.1.3.1.6. Runs List / Get / Export /Cancel

20.1.4. Libraries API

20.1.4.1. Some of the supported actions via the API:

20.1.4.1.1. All Cluster Statuses

20.1.4.1.2. Cluster Status

20.1.4.1.3. Install

20.1.4.1.4. Uninstall

20.1.5. Secrets API

20.1.5.1. Actions pertaining to:

20.1.5.1.1. Secret scopes

20.1.5.1.2. Secrets

20.1.5.1.3. ACLs

20.1.6. Workspace API

20.1.6.1. Some of the supported actions via the API:

20.1.6.1.1. Delete

20.1.6.1.2. Export

20.1.6.1.3. Get Status

20.1.6.1.4. Import

20.1.6.1.5. List

20.1.6.1.6. Mkdirs

20.2. All of these APIs allow us to interact with Databricks without using the UI

20.2.1. This becomes useful for automation scenarios

20.3. Access Databricks REST API using Postman

20.3.1. Postman is a user friendly tool for using REST APIs, and is free for personal use

20.3.2. Start by creating a new access token for Postman in Databricks

20.3.3. After starting up Postman, create a new request and configure the Authorization

20.3.3.1. Set Type to Bearer token

20.3.3.2. Paste in the Databricks access token

20.3.4. Prepare URL with REST API call and click Send button (HTTP GET method)

20.3.4.1. The base URL can be either of the following:

20.3.4.1.1. <region>.azuredatabricks.net

20.3.4.1.2. <instance>.azuredatabricks.net

20.3.4.2. To direct the request to the REST API, you append "/api/2.0" to the base URL

20.3.4.2.1. See link for confirmation, as version of API may change over time

20.3.4.2.2. The final part of the URL identifies the required API and the required action

20.3.4.3. Here's another example using get action in clusters API

20.3.4.3.1. Note that the "?" following get introduces what are known as "query parameters"

20.3.5. The HTTP GET method can only be used for certain API actions that just return info (in Json format)

20.3.5.1. Other API actions require the HTTP POST method

20.3.5.1.1. Unlike the GET method, the POST method always requires a Body

20.3.5.1.2. Here's an example using create action in clusters API

20.4. Access Databricks REST API using PowerShell

20.4.1. There is a community-driven PowerShell module by Gerhard Brueckl that interacts with the Databricks APIs

20.4.1.1. Run PowerShell as Administrator and run the following commands to set things up:

20.4.1.1.1. Install-Module DatabricksPS

20.4.1.1.2. Import-Module DatabricksPS

20.4.1.2. In the PowerShell gallery, under package details, you can see the full list of functions available for the DatabricksPS module

20.4.2. We need a Databricks access token in order to store this in a variable using PowerShell

20.4.2.1. Run this command in PowerShell:

20.4.2.1.1. $accessToken = "<databricks_access_token>"

20.4.3. The final bit of setup before running cmdlets that invoke the Databricks REST API is to set the Databricks environment

20.4.3.1. Run these commands in PowerShell:

20.4.3.1.1. $apiRootUrl = "https://uksouth.azuredatabricks.net"

20.4.3.1.2. Set-DatabricksEnvironment -AccessToken $accessToken -ApiRootUrl $apiRootUrl

20.4.4. Examples:

20.4.4.1. Get-DatabricksCluster

20.4.4.1.1. Note that this only lists the interactive clusters, but if you want to see all the job clusters too, you can run this command:

20.4.4.1.2. Repeated use of the <Tab> in PowerShell is really useful for seeing what the options are both in the context of cmdlets and parameters

20.4.4.2. Add-DatabricksCluster

20.4.4.2.1. 4 parameters are mandatory:

20.4.4.3. Stop-DatabricksCluster

20.4.4.3.1. -ClusterID parameter (string) is required

20.4.4.4. Export-DatabricksWorkspaceItem

20.4.4.4.1. Using this function, we can export notebooks to a local file system

20.4.4.4.2. 3 required parameters:

20.4.4.5. Get-DatabricksFSItem

20.4.4.5.1. Using this function we can browse the DBFS for our environment

20.4.4.5.2. -Path

20.4.4.5.3. -ChildItems

20.4.4.6. Upload-DatabricksFSFile

20.4.4.6.1. Using this function we can upload files to DBFS from the local file system

20.4.4.6.2. 2 required parameters:

20.4.4.7. Remove-DatabricksFSItem

20.4.4.7.1. Using this function we can delete files and directories in DBFS

20.4.4.7.2. -Path

21. Databricks security

21.1. There are a number of features that are supported only when the provisioned Azure Databricks service is tied to the premium pricing tier

21.2. Admin console

21.2.1. Users

21.2.1.1. The Admin and Allow cluster creation options are checked but inaccessible for standard pricing tier

21.2.1.2. When you add users they must first exist in Azure Active Directory

21.2.1.2.1. You can add users that don't exist in AAD but they won't be able to access Azure Databricks until they are set up in AAD

21.2.2. Groups

21.2.2.1. When creating a new group, you can add either existing users or groups as members

21.2.2.2. You can set entitlements for the group, which only specifies whether or not group members can create clusters

21.2.2.2.1. For the standard tier, these settings are available but essentially useless I believe because under standard pricing all added users have the Admin role and permission for cluster creation

21.2.2.3. When using group hierarchies (groups added to groups) bear in mind that the entitlements of the parent group automatically apply down to the child

21.2.2.3.1. So an apparent "deny" permission (i.e. create cluster permission turned off) at child group level is overridden by an "allow" (cluster creation) permission at the parent group level

21.2.3. Workspace Storage

21.2.3.1. Deleted notebooks, folders, libraries, and experiments are recoverable from the trash for 30 days

21.2.3.1.1. Clicking Purge for workspace storage allows you to "empty the trash" permanently and stop paying for the storage of that trash, but you will no longer be able to recover its contents

21.2.3.2. Notebook revision history is automatically maintained, which is great but the more revision history is built up for a notebook, the more that adds to storage cost

21.2.3.2.1. Clicking Purge for revision history in combination with choosing a timeframe allows you to stay on top of revision history and get rid of everything permanently that was captured outside of selected timeframe

21.2.3.3. Clusters automatically maintain event logs, driver logs and metric snapshots, even for terminated clusters, which builds up over time and consumes storage in the background

21.2.3.3.1. Clicking Purge for cluster logs permanently gets rid of all cluster logs (event and driver) and metric snapshots

21.2.4. Access Control

21.2.4.1. 3 out of 4 options are disabled and can only be enabled when the premium pricing tier is selected

21.2.4.1.1. Workspace access control (premium only - always disabled in standard tier)

21.2.4.1.2. Cluster, pool and jobs access control (premium only - always disabled in standard tier)

21.2.4.1.3. Table access control (premium tier only - always disabled in standard tier)

21.2.4.2. Personal access tokens is enabled by default and setting is controllable via standard and premium tiers

21.2.5. Advanced

21.2.5.1. In the PragmaticWorks training course, there was only one option here relating to enabling a runtime for genomics but when I was studying this in Oct 2020 the options here expanded greatly to 12

21.2.5.1.1. It appears that all advanced options are available in standard tier

21.2.5.1.2. A few notable things that can be controlled here include the following (all enabled by default):

21.2.6. Global Init Scripts

21.2.6.1. This option did not even exist when PragmaticWorks created their Databricks training course

21.2.6.2. Global init scripts run on all cluster nodes launched in your workspace.

21.2.6.3. They can help you to enforce consistent cluster configurations across your workspace in a safe, visible, and secure manner.

21.3. Azure Active Directory integration

21.3.1. Enables Single Sign On (SSO) for Databricks

21.3.2. Conditional access is an AAD feature that allows you to restrict access to Databricks workspaces based upon location, requiring multi-factor authentication, etc.

21.3.2.1. We could restrict user access to Databricks only from a particular Vnet this way I think

21.3.2.2. Conditional access is only available via AAD Premium (not Standard)

21.4. System for Cross-domain Identity Management (SCIM)

21.4.1. This is an open standard for automating user provisioning

21.4.2. Databricks REST API includes a SCIM API that enables you to programmatically create, update and delete users and groups

21.5. Role Based Access Control (RBAC)

21.5.1. RBAC is available across many Azure services, including Databricks, for Identity Access Management (IAM)

21.5.2. RBAC only applies to Databricks when you choose the Premium tier

21.6. Implementing Table Access Control

21.6.1. After enabling Table Access Control via the Admin console (premium tier option only), the next step for setting up the access is to provision a cluster

21.6.1.1. To support table access control, the cluster must be provisioned with the "High Concurrency" cluster mode (not Standard)

21.6.1.1.1. The advanced option for enabling table access control is only visible after it's already been enabled via the Admin console

21.6.2. Once you have a secure cluster provisioned (i.e. a high concurrency cluster with table access control enabled) this allows you to run security related commands in SQL, for example

21.6.2.1. Example of changing table owner via SQL statement in a notebook

21.6.2.1.1. Note that this command can only work if the notebook is attached to a "secure" cluster

21.6.2.2. Example of granting a user Select access to a table

21.6.2.3. Example of denying a user access to a table

21.6.2.4. Example of granting a group Select access to a table

21.6.2.4.1. Group needs to exist under Groups in the Admin console

21.6.2.5. Example of granting a user Select permission to a database

21.6.2.6. Any attempt to run Scala or R code will fail when the notebook is connected to a cluster that supports table access control

21.7. BYO VNET

21.7.1. Stands for Bring Your Own Virtual Network

21.7.2. This is a feature that allows you to migrate your Databricks workspace into a customer-managed Vnet in Azure

21.7.2.1. You can then use another feature call VNet Peering to pair the Databricks VNet to another Azure VNet that hosts a VNet Gateway, which in turn enables secure access to on-prem data

21.7.2.2. You can grant access to specific Azure endpoints (e.g. ADLS Gen2)

21.7.2.3. You can use custom DNS

21.8. Azure AD Passthrough

21.8.1. This feature allows you to connect directly to ADLS Gen1 or Gen2 storage from Databricks using the same AAD credentials used to access Azure Databricks

21.8.1.1. Only High Concurrency clusters allowed

21.8.1.1.1. No support for Scala, only Python, SQL and R

21.8.1.2. Passthrough clusters default to using Data Lake storage, not DBFS

21.8.1.2.1. You need to update Spark Config to allow DBFS access