1. Databricks
1.1. The commercial enterprise behind Apache Spark
1.2. Developed by original team behind Apache Spark - several original researchers/developers still involved with Databricks
1.3. Databricks the company has several products, one of which is named the Databricks Unified Analytics Platform (UAP)
1.3.1. When we talk about Azure Databricks we are talking about the Databricks UAP that is hosted in Azure
1.3.2. Databricks UAP is also available via Amazon Web Services (AWS)
1.3.3. Databricks UAP is cloud-based environment for hosting Apache Spark
1.3.4. Databricks UAP has its own data lake called Delta Lake
1.3.4.1. Delta Lake is a transactionally consistent data lake
1.4. Databricks is a fully supported third-party service in Azure, not just a marketplace offering
1.4.1. Microsoft has worked closely with Databricks on this
1.4.1.1. Azure Data Factory data flows are built on Databricks UAP (clusters are spun up in the background to provide the compute resource for data flows)
1.5. Databricks has strong features for machine learning but it is more than that and actually has all the elements needed to deliver a modern data warehouse, including strong data transformation abilities and SQL query support
2. Provisioning a new Databricks service
2.1. All Services | Azure Databricks
2.1.1. Create Azure Databricks service
2.1.1.1. 1. Select Subscription
2.1.1.1.1. e.g. Pay-as-you-go
2.1.1.2. 2. Select Resource group
2.1.1.2.1. e.g. argento-internal-training
2.1.1.3. 3. Set Workspace name
2.1.1.3.1. e.g. argento-databricks
2.1.1.4. 4. Select Location
2.1.1.4.1. e.g. UK South
2.1.1.5. 5. Select Pricing tier
2.1.1.5.1. e.g. Standard
2.1.1.6. 6. (Optional) Deploy Azure Databricks in your own virtual network (VNet)
2.1.1.7. 7. (Review + Create) Click Create
2.1.1.7.1. Consider option for creating ARM template if your intention is to spin up cluster and decommission it over and over, or you want to promote it to a downstream environment
2.2. When you provision an Azure Databricks service, you don't yet have a cluster - these are launched via the workspace
3. Launch Databricks workspace
3.1. To start Databricks development, you need to launch a workspace for your provisioned Azure Databricks service, and this will launch in a separate browser tab, much like it does for Azure Data Factory
3.2. Azure Databricks | Launch Workspace
3.2.1. Click Explore the Quickstart Tutorial
3.2.1.1. Follow the instructions in the notebook
3.2.1.1.1. The Quickstart Tutorial notebook has SQL as its default language
3.2.1.1.2. The cells of the notebook demonstrate the following concepts
4. Databricks components
4.1. Workspace
4.1.1. Home for all other Databricks objects (e.g. notebooks, clusters, tables, etc.)
4.1.2. We need to create at least one workspace before creating anything else
4.1.3. Workspaces may be split by environment (DEV, TEST, PROD, etc.) or by teams, for example
4.2. Clusters
4.2.1. Databricks builds Spark clusters for us (which is a great benefit vs manual provisioning of Apache Spark on-prem)
4.2.1.1. We define a few things for Databricks to build the cluster:
4.2.1.1.1. Size of each machine (driver and workers)
4.2.1.1.2. Version of Databricks Runtime
4.2.1.1.3. Range for auto-scaling worker nodes
4.2.2. Each cluster has one driver and one or more workers
4.2.2.1. Workers are also known as executors
4.2.2.2. Driver controls jobs and sends commands to workers
4.2.2.3. Drivers should be powerful VMs in production clusters
4.2.2.3.1. Generally drivers will be the most busy of the nodes in a typical cluster and the first node you should look to scale up for busy clusters
4.2.3. By default clusters terminate after 120 minutes of inactivity
4.2.3.1. This can be changed and you also have the option to turn the option off so that the cluster never auto-terminates due to inactivity
4.2.3.2. The longer a cluster runs for, the more cost will be accrued, so keeping it running 24x7 should only be reserved for Production environments that have workloads sufficiently continuous to justify it
4.2.4. Provisioning a new cluster
4.2.4.1. 1. Set Cluster Name
4.2.4.1.1. e.g. argento-db-cluster1
4.2.4.2. 2. Select Cluster Mode
4.2.4.2.1. Options:
4.2.4.3. 3. Select Pool
4.2.4.3.1. Default choice is None
4.2.4.3.2. Pools are a way to reduce cluster start-up time
4.2.4.4. 4. Select Databricks Runtime Version
4.2.4.4.1. Runtimes with "ML" in the name will come with machine learning modules built in
4.2.4.4.2. GPU in runtime name means cluster nodes will use GPU instead of CPU
4.2.4.5. 5. Set Autopilot Options
4.2.4.5.1. Enable autoscaling
4.2.4.5.2. Terminate after X minutes of inactivity
4.2.4.6. 6. Select configuration for Worker and Driver machines
4.2.4.6.1. Default will be a General Purpose Standard, and is intended to provide balanced systems (i.e. sensible balance of memory, processor cores, etc.)
4.2.4.6.2. Standard General Purposes uses SSD for disk, General Purpose (HDD) will use magnetic disk, reducing VM cost but slower on I/O operations
4.2.4.6.3. Memory optimized provides configs with balance tilted towards memory, which is good when your workloads demand more memory intensive operations
4.2.4.6.4. Storage optimized provides configs with balance tilted towards storage capacity, which is good when your workloads demand more I/O intensive operations
4.2.4.6.5. Compute optimized provides configs with balance tilted towards CPU capacity, which is good when your workloads demand more CPU intensive operations
4.2.4.6.6. GPU accelerated provides configs that are only available when you choose one of the GPU runtimes
4.2.4.7. 7. Click Create Cluster
4.2.4.7.1. Normally takes around 5 mins to provision a standard cluster
4.2.4.7.2. Optional Advanced Config options include:
4.2.4.8. For automation, you can use the JSON option, and adapt it with parameters
4.3. Notebooks
4.3.1. Provide our primary means of interaction with Databricks clusters
4.3.2. We write our code in notebooks and also create our documentation here
4.3.3. Control operations are made through notebooks - there is no console interface
4.3.4. Supports 4 primary languages: Scala, Python, R, SQL
4.3.4.1. Supports some secondary languages too
4.3.4.2. We can switch languages inside a single notebook
4.4. Libraries
4.4.1. Databricks allows you to incorporate third-party libraries in your clusters
4.4.1.1. Options include:
4.4.1.1.1. Build and load custom JAR files
4.4.1.1.2. Import WHL packages
4.4.1.1.3. R libraries
4.4.1.1.4. PyPI
4.4.1.1.5. CRAN (for R)
4.5. Folders
4.5.1. Use to organise notebooks and other resources
4.6. Jobs
4.6.1. Databricks jobs allow you to schedule automated tasks such as data loads
4.6.2. Job clusters tend to be less expensive than interactive clusters, so it's better to perform regular tasks as scheduled jobs instead of running them manually
4.6.3. When you create a new job, you can tie it to existing notebook
4.7. Data
4.7.1. When we use SQL CREATE TABLE in a Databricks notebook, this creates a table, which a structured collection of data
4.7.2. A database in Databricks is just a collection of tables
4.7.3. See attached for screenshot of diamonds table created via the Quickstart notebook
4.7.4. Interestingly, data exists independently of clusters and therefore attracts storage costs independently of clusters
4.7.4.1. However, data can only be access via an active cluster
4.7.5. When you select Data, you get an option to Add Data (as well as view existing databases and tables)
4.7.5.1. You can create new tables via a local file upload (to DBFS)
4.7.5.1.1. DBFS is the Databricks File System
4.7.5.1.2. DBFS follows the same concept of distributed storage as HDFS (Hadoop's distributed file system)
4.7.5.1.3. Uploaded local files go into /FileStore/tables/ (in DBFS)
4.7.5.2. You can create new tables from files already loaded to DBFS
4.7.5.3. You can create tables from other data sources
4.7.5.3.1. Azure Blob Storage and Azure Data Lake are two popular choices here
4.7.5.3.2. If your data source is a database of some sort (e.g. SQL Database), then you should use the JDBC option
5. Databricks File System (DBFS)
5.1. Distributed file system
5.2. Patterned after Hadoop Distributed File System (HDFS)
5.3. Built on top of blob storage
5.3.1. Azure Blob Storage
5.3.2. Amazon S3
5.4. Supports additional blob storage mount points
5.4.1. Allows you to store you data in Blob storage with a DBFS mount reference, avoiding the need to copy the data into DBFS
6. Deleting data
6.1. Remember that any files uploaded to DBFS or tables created using the UI or a notebook will continue to attract storage costs even when your cluster is terminated or deleted
6.2. Dropping tables
6.2.1. %sql
6.2.2. DROP TABLE <table_name>;
6.3. Listing DBFS files
6.3.1. dbutils.fs.ls('directory/sub-directory')
6.4. Deleting DBFS files
6.4.1. dbutils.fs.rm('directory/sub-directory/file-name
7. Visualisations and Dashboards
7.1. By default, DataFrames are presented in Notebooks as tables, but 11 other visualisations are available
7.1.1. Pivot
7.1.1.1. Presents data in pivot table format
7.1.1.2. Good for looking at data aggregations (sum, average, count, etc) by two dimensions
7.1.1.2.1. If you put two or more attributes on a row or column, the values will just be concatenated - you won't get hierarchical navigation like you would with Excel
7.1.2. Bar & Pie charts
7.1.2.1. Good for categorical data - i.e. aggregations spread out by a relatively small number of categories for the purpose of comparing those categories
7.1.2.2. Bar charts are generally considered to be preferable to pie charts as they hold greater analytic value
7.1.3. Line charts
7.1.3.1. Good for looking at aggregated data over time
7.1.4. Scatter Plot charts
7.1.4.1. Good for looking at correlations between two numeric data points in a set (e.g. how correlated is diamond price to diamond size)
7.1.5. Histogram charts
7.1.5.1. Good for looking at a single numeric measure and you are interested to see which numeric values (within "bins") occur the most or least, and patterns in how the values occur across the total range
7.1.5.1.1. Bins are equal sized sub-ranges that divide the total range of values from minimum to maximum value
7.1.6. Box Plot charts
7.1.6.1. Good for comparing distributions of some measure (e.g. exam score) by some useful attribute (e.g. year)
7.1.6.2. Visualises percentiles in a "box" - e.g. 25th percentile, 50th percentile, 75th percentile
7.1.6.2.1. "Whiskers" either side of box show the "normal" distribution to min and max
7.1.6.2.2. "Outliers" are data points outside of normal distribution (statistical anomalies)
7.1.7. Map charts
7.1.7.1. Good for visualising geographical data
7.1.8. You can also import additional visuals using Python libraries (e.g. bokeh) or R (ggplot)
8. Languages and Libraries
8.1. Supported languages
8.1.1. Scala
8.1.1.1. Pros
8.1.1.1.1. Terse, well-supported language (it is the native language for Spark)
8.1.1.1.2. Best support for native Spark packages
8.1.1.2. Cons
8.1.1.2.1. Least known outside of Spark
8.1.1.2.2. No support for high-concurrency clusters
8.1.2. Python
8.1.2.1. Pros
8.1.2.1.1. Most popular "core" language
8.1.2.1.2. Great support in data science, especially neural networks
8.1.2.1.3. Extensive set of libraries for general-purpose computing
8.1.2.2. Cons
8.1.2.2.1. Often the slowest language
8.1.3. SQL
8.1.3.1. Pros
8.1.3.1.1. Most common language
8.1.3.1.2. Spark has great SQL support
8.1.3.1.3. Separate Catalyst engine can optimize query performance
8.1.3.2. Cons
8.1.3.2.1. SQL is a domain-specific language
8.1.3.2.2. SQL is often a "secondary" language
8.1.4. R
8.1.4.1. Pros
8.1.4.1.1. Excellent support for data science work
8.1.4.1.2. Extensive third-party ecosystem
8.1.4.1.3. Key language
8.1.4.2. Cons
8.1.4.2.1. R is a domain-specific language
8.1.4.2.2. Not all packages support multi-core processing, much less multi-server
8.2. Libraries
8.2.1. External libraries can be added in 5 different ways:
8.2.1.1. Upload files (from your local file system)
8.2.1.2. Load libraries from DBFS (server-side file system of Databricks)
8.2.1.3. Grab Python libraries from PyPi
8.2.1.4. Grab Scala libraries from Maven coordinates
8.2.1.5. Grab R libraries from CRAN
8.2.2. Library modes
8.2.2.1. This refers to the scope of a library that you add to Databricks
8.2.2.2. Workspace-level libraries
8.2.2.2.1. Make libraries available to all users and all clusters
8.2.2.2.2. Put libraries in the Shared folder for multi-user support
8.2.2.2.3. Put libraries in specific Users folder for single-user libraries
8.2.2.2.4. Libraries persist even if you delete all your clusters
8.2.2.3. Cluster-level libraries
8.2.2.3.1. Add to running clusters, the Libraries tab features an Install New button for this
8.2.2.3.2. Libraries will persist if you stop a cluster (and later restart) but will be lost if you delete the cluster
8.2.2.4. Notebook-level libraries
8.2.2.4.1. Certain Python libraries can be installed on a per notebook basis
8.2.2.4.2. Other libraries installed via notebooks will be installed as cluster-level libraries
8.2.3. Creating libraries
8.2.3.1. New shared workspace library
8.2.3.1.1. Library creation options
8.2.3.2. New user-specific workspace library
8.2.3.3. New library in an R notebook (i.e. notebook scoped library)
8.2.3.3.1. This example installs an R library named "gapminder" in a notebook attached to a cluster that did not already have this library installed
8.2.3.3.2. Note that the attached cluster did already have the benford.analysis library installed, which is why the notebook only includes the library() reference call rather than the install.packages()
8.2.3.3.3. Note that installing a library from a notebook means it will be installed on the attached cluster but the library will not show up on the cluster as one of its installed external libraries
8.2.3.4. New library in a Python notebook (i.e. notebook scoped library)
8.2.3.4.1. The example (in screenshot) was used to demonstrate the notebook import feature
8.2.3.4.2. What you actually need to install a notebook scoped library is to use the %pip magic or the %conda magic
8.2.3.4.3. Note: if you use Python import without the library being pre-installed on the cluster, or without a preceding cell that installs it via the %pip or %conda magic, or the dbutils command, you will get an error to the effect that the library is not found
8.2.3.5. Best practices
8.2.3.5.1. Consider different clusters for different use cases
8.2.3.5.2. Adding too many external libraries to a cluster can have a substantial impact on time it takes to spin cluster up
8.2.3.5.3. Some use cases may require older versions of a particular library and this will require separate clusters as you cannot have two different versions of same external library installed simultaneously on single cluster
8.2.4. Library management
8.2.4.1. Moving libraries to different folders
8.2.4.1.1. It's easy for libraries stored in DBFS to become messy in terms of file system organisation, so we can move them
8.2.4.1.2. For example, we can create a Libraries sub-folder under Shared and move a bunch of libraries to this
8.2.4.1.3. Moving library will not affect existing cluster installations
8.2.4.2. Deleting libraries
8.2.4.2.1. Removing external libraries from a cluster is done by selecting and clicking the Uninstall option
8.2.4.2.2. "Uninstalling" is slightly misleading as what happens is that the library is removed from the list of external libraries
8.2.4.2.3. When the cluster is restarted, the external library will no longer be installed
8.2.4.3. Upgrading libraries
8.2.4.3.1. There is no automated updates for libraries
8.2.4.3.2. Basically you need to remove the library and then recreate it (e.g. from PyPI)
9. Integration with Azure Data Factory
9.1. ADF supports the following pipeline activities for Databricks
9.1.1. Notebook
9.1.2. Jar
9.1.3. Python
9.2. To enable ADF to connect to Databricks, we need to create an access token for it in Databricks
9.2.1. In Databricks workspace, click on user icon (top right) and choose User Settings
9.2.2. In User Settings we have the option to generate a new access token
9.2.3. When generating new access token, we give a description (of token's purpose) and an expiry time in days
9.2.3.1. Default expiry period is 90 days but you can change this
9.2.3.2. Good description will identify the service or application that will use the access token for authentication and authorization when connecting to the Databricks service
9.2.4. The new token must be copied immediately as it will no longer be available after the dialog window is closed
9.3. In ADF, we need to create a linked service for Databricks
9.3.1. When creating a linked service, Databricks is one of the options under Compute
9.3.2. In the config properties for the new Databricks linked service, we paste in the secret access token value from the token we generated earlier in Databricks
9.4. In ADF, we create a pipeline with a Databricks Notebook activity
9.4.1. In configuring the Databricks Notebook activity, we bind it to the linked service
9.4.2. In configuring the Databricks Notebook activity we bind it to a notebook hosted in our Databricks workspace
9.5. We can test the pipeline via a manual trigger
9.5.1. The pipeline can be monitored in ADF
9.5.2. We can also observe the job cluster spinning up in Databricks
10. Connecting Databricks to Power BI
10.1. The supported method for this is to connect to Databricks tables as a source for Power BI
10.1.1. This requires an active, running Databricks cluster
10.2. In the Power BI Get Data dialog, there is an option for Azure Databricks under the Azure group
10.3. You will be prompted to enter the server host name and HTTP path
10.3.1. These values can be retrieved via the Databricks UI
10.4. You will be prompted for authentication, and one of the options is to use a (Databricks) personal access token
10.4.1. We can also go with Azure Active Directory authentication
10.5. When you connect, you can see all available Databricks tables and you can select any for loading
10.6. Once connected, you can build out Power BI reports sourced from Azure Databricks tables
11. Databricks Secrets API
11.1. Secrets are sensitive information, such as passwords, access tokens and connection strings, that we don't want to appear in plain text within Databricks notebooks
11.2. Secrets for use in Databricks notebooks can be stored in one of two places: Databricks Secrets or Azure Key Vault
11.2.1. Using the Databricks Secrets API, we can create, retrieve and delete secrets stored in Databricks
11.2.1.1. We can also read secrets from Azure Key Vault (which has its own UI via Azure Portal, plus a separate REST API that can also be used via separate PowerShell module)
11.3. Databricks hosted secrets can have access control lists (ACLs) configured for premium tier workspaces
11.3.1. Permissions: Manage, Write, Read
11.4. Creating a secret in Azure Key Vault
11.4.1. Very easy to provision Azure Key Vault and add a new secret, such as the access token for an ADLS Gen2 storage account
11.5. Creating a secret using Databricks Secrets API and PowerShell
11.5.1. In order to access the Databricks Secrets API and successfully invoke commands, you need to authenticate
11.5.2. Authentication to the Databricks Secrets API can be made using Databricks personal access tokens
11.5.3. PowerShell console
11.5.3.1. In order to use the Databricks Secrets API via PowerShell, you'll need to install and import the DatabricksPS module and ensure the execution policy is set to RemoteSigned
11.5.3.1.1. Once this is taken care of via PowerShell run as Administrator, you should be able to run commands via non-Admin PowerShell session
11.5.3.2. Create a couple of variables, one for the Databricks personal access token and one for the API root URL
11.5.3.2.1. PowerShell variables start with $ and are not case sensitive
11.5.3.3. Set-DatabricksEnvironment
11.5.3.3.1. This is always the first cmdlet to call in DatabricksPS as it establishes the authenticated connection to the Databricks API
11.5.3.4. Get-DatabricksSecretScope
11.5.3.4.1. Check for the existence of any secrets stored in the Databricks secret scope and list those out
11.5.3.5. Add-DatabricksSecretScope
11.5.3.5.1. Create a new secret scope hosted in Databricks
11.5.3.5.2. Note that to avoid the prompt seen in the screenshot, I should have included parameter -InitialManagePrincipal "users"
11.5.3.6. Add-DatabricksSecret and Get-DatabricksSecret
11.5.3.6.1. Add a new secret to a specific Databricks secret scope and then list out the secrets in that scope to verify it's been added
11.5.3.6.2. If you reference a scope that is Azure Key Vault backed, you'll get an error because the Databricks Secrets API only permits read-only operations with Azure Key Vault
11.5.3.7. Link an Azure Key Vault instance to Databricks secret scopes
11.5.3.7.1. Rather than via PowerShell, this is done via a hidden part of the Azure Databricks UI
11.5.3.8. Remove-DatabricksSecret
11.5.3.9. Remove-DatabricksSecretScope
11.6. Accessing secrets in a notebook
11.6.1. We declare a variable and assign it a value using dbutils.secrets.get()
11.6.1.1. We pass two string arguments for dbutils.secrets.get(): scope and key
11.6.1.2. See attached example
11.6.1.2.1. Note that variable assignment syntax is identical between Scala and Python apart from Scala requiring the keyword "val" as a prefix
11.6.1.2.2. Note that the secret held in a variable returns "[REDACTED]" if you try to print it
11.6.1.2.3. If someone has access to the Databricks workspace and it's Standard tier (meaning all users have full access with no way to lock access down with finer grained control), you can work around secret redaction by using a simple for loop to enumerate the secret, character by character
12. Apache Spark
12.1. Technology that Databricks is built on
12.2. Free, open source project that implements distributed, in-memory clusters
12.2.1. Part of Apache Hadoop ecosystem
12.2.2. Originally developed at University of Berkley and rapidly became one of the largest open source projects in the world
12.3. Spark DataSets
12.3.1. introduced with Spark 2.0
12.3.2. These are strongly-typed RDDs
12.3.2.1. Type can be a class
12.3.2.1.1. e.g. Employee or Invoice
12.3.2.2. Can be a collection of primitive data types
12.3.2.2.1. e.g. string, int, etc.
12.3.2.3. Can be a single data type
12.4. Spark DataFrames
12.4.1. DataSets with named columns
12.4.2. Structure starts to resemble a table
12.4.3. Key data structure for Spark SQL
12.5. Spark SQL
12.5.1. ANSI-compliant SQL statements can be run on a Spark cluster using DataFrames
12.5.2. Uses its own cost-based optimizer called Catalyst
12.5.2.1. This is separate and distinct to Spark's own cost-based optimizer that it uses when converting Scala (or Python, Java) code into an execution plan
12.6. Spark Streaming
12.6.1. Based on concept of Discretized Streams, which build RDDs over very small windows (typically milliseconds) and process each as independent batch
12.6.1.1. Known as the microbatch approach to streaming, which makes streaming easier to understand
13. Resilient Distributed Datasets (RDDs)
13.1. Dataset is unstructured collection of data
13.1.1. Don't confuse with Spark DataSets
13.2. Distributed means data is shared across multiple nodes and worked on concurrently, allowing linear scale
13.3. Resilient means the driver (control server) notices when an executor (processing node) fails to respond and orders another executor to take over
13.4. RDDs are immutable
13.4.1. It is not possible to modify an existing RDD, you have to create a new RDD from the first one in order to make modifications
13.4.2. Immutability is important to facilitate concurrency, bearing in mind that there will be multiple executors working on the same RDD concurrently
14. Use cases for Databricks
14.1. Relational database vs Apache Spark (or Hadoop) for processing large data sets
14.1.1. Seeks
14.1.1.1. Relational databases are fast to retrieve specific records within large datasets when configured correctly
14.1.1.2. Spark is slow on single point lookup within large datasets, with higher overhead for such operations
14.1.2. Scans
14.1.2.1. Relational databases are adequate on scan speed for large datasets, and can generally handle 10s of millions of rows quite well, when configured properly
14.1.2.2. Spark delivers great scan speed on large distributed datasets
14.1.3. Memory
14.1.3.1. Relational databases are typically memory limited with performance degrading as data size breaches memory size, resulting in excessive disk I/O
14.1.3.2. Spark uses memory across a cluster of machines
14.1.4. Clusters
14.1.4.1. Relational databases are typically bound to a single machine
14.1.4.1.1. There are clustering options for relational databases but they are more complicated to set up, limited in cluster size and fundamentally the technology is not designed for highly scalable clustering
14.1.4.2. In Spark, clusters are fundamental and it's very easy to scale out by adding nodes to a cluster
14.1.5. Scaling
14.1.5.1. Relational databases are hard to scale out and typically are scaled up in response to work load growth
14.1.5.2. Spark nodes can be scaled up too, but typically you will scale out
14.2. Apache Hadoop vs Apache Spark for processing large data sets
14.2.1. Hadoop is based on MapReduce concept, which is an algorithm for performing computations across a cluster
14.2.1.1. Hadoop's implementation of MapReduce is slow because it involves reading from disk, doing some work, writing results to disk and repeating this cycle over and over
14.2.1.1.1. Apache Tez is a built-in project that optimises Hadoop workloads by building and using Directed Acyclic Graphs (DAGs) that reduce disk activity for MapReduce jobs
14.2.2. Spark uses in-memory compute
14.2.2.1. Spark still uses MapReduce concept but all in memory rather than on disk
14.2.3. Spark supports interactive analysis, including streaming data, whereas Hadoop is fundamentally suited to batch analysis only (i.e. jobs that run for hours and produce reduced data sets for analysis or reports)
14.3. Use cases for relational databases in favour of Databricks
14.3.1. Transactional processing requiring fast writes and fast point lookups
14.3.2. Single source of truth systems that require ACID properties
14.3.2.1. Systems that have legal or regulatory requirements for correct data with no transaction loss guaranteed
14.3.3. Frequently changing data
14.3.3.1. Spark does not support updates - you can delete data and re-insert it, but this is not efficient if you have requirements for doing this frequently
14.3.4. Pre-calculated reports
14.3.4.1. This refers to the ease and familiarity of connecting relational databases to access reporting data using tools like Excel
14.3.4.2. Spark can work in conjunction with relational databases on this
14.3.4.2.1. You may have Spark prepare the report data but copy that into a relational database for the report client application to access
14.4. Use cases for Spark in favour of relational databases
14.4.1. Analytical systems
14.4.1.1. When the purpose of system is analytical, Spark excels in this area
14.4.2. Batch reporting systems
14.4.2.1. Spark is also good for batch processing of data to prepare for reporting
14.4.2.1.1. The reporting data prepared by Spark is often landed in a relational database for downstream consumption
14.4.3. Error tolerant systems that can afford small percentage loss of data without causing a serious issue
14.4.3.1. Good example would be web activity data - losing a small number of transactions relating to website clickthrough data for example would not harm the overall value of the analytic system for understanding website usage
14.4.4. Stable data sets
14.4.4.1. Best for batch data insertion with limited deletion and no modification
14.4.5. ELT approach
14.4.5.1. When the goal is to extract data from source systems as fast as possible with zero transformation work and then hand the work of processing that data off to a separate process that does not impact the source systems at all, Spark excels
14.4.5.2. For the load and transform phases, the work is parallelizable
14.5. Spark vs Azure Synapse Analytics
14.5.1. Synapse enables you to perform transformations using SQL and has the same benefits of being highly scalable for massively parallel workload distribution
14.5.1.1. SQL available in Synapse is limited to a subset of language due to the parallel nature of the solution
14.5.1.2. Spark enables you to mix languages in order to perform much more complex data transformations than would be possible using Synapse
14.6. Spark vs SSIS or ADF (for data movement)
14.6.1. SSIS has the benefit of having many developers readily available but it requires an on-prem server or an IaaS VM in Azure
14.6.2. ADF has many pros as an ELT/ETL tool when your target is n Azure, and can be expected to become increasingly feature rich
14.6.2.1. Data flows in ADF are actually based on Databricks clusters that spin up in the background, however the transformations available in ADF data flows are more limited than those available from a dedicated Databricks solution
14.7. Databricks provides a tool for both data engineering and data science
14.7.1. Data engineering is all about building ELT/ETL data pipelines to move data and/or clean data
14.7.1.1. Languages of choice are Scala, Python and SQL
14.7.2. Data science is all about the analysis of clean data produced by data engineering
14.7.2.1. Focus is on building models and applying algorithms to those data models for solving business problems, and making results available for loading into downstream systems (e.g. a data warehouse)
14.7.2.2. Languages of choice are R, Python and SQL
14.8. Spark Streaming
14.8.1. Analysing fast-moving data via near real-time ELT (microbatches)
14.8.2. Great for:
14.8.2.1. Data pipelines for real-time dashboards
14.8.2.2. Business process triggers (e.g. flagging bad orders)
14.8.2.3. Anomaly detection
14.8.2.4. Generating recommendations
15. Databricks pricing
15.1. Estimates available via the Azure pricing calculator
15.2. Total cost of ownership includes:
15.2.1. VM instances
15.2.1.1. Driver VM + Worker VMs
15.2.1.2. You pay for all the time these VMs remain provisioned and allocated to you for exclusive use
15.2.1.2.1. Even when a cluster is idle, you pay for the VMs
15.2.1.2.2. Terminating the cluster halts the VM charges
15.2.1.2.3. If you need your VMs to be up most of the time across an average month, you can save substantially on Pay-as-you-go charges by optin for the 1 or 3 year reserved pricing model
15.2.2. Databricks Units (DBUs)
15.2.2.1. Pricing per unit depends on workload and cluster type
15.2.2.2. Each VM configuration carries a DBU rating that reflects its relative performance capability and an associated hourly cost
15.2.2.3. DBU charges apply when workloads are running, not when clusters remain idle
15.2.3. Data storage
15.2.3.1. This is a general Azure cost, not Databricks specific
15.2.3.2. The more data you store for Databricks to process regularly, the more VMs typically you will provision for processing that data, so indirectly the bigger your data for processing, the bigger your VM costs will be
15.2.4. Network bandwidth
15.2.4.1. This is a general Azure cost, not Databricks specific
15.2.4.2. Azure charges for egress, which occurs when data is transferred between Azure regions or outside of Azure
15.3. Region also affects price
15.3.1. I was surprised to note that price estimate was lower simply when toggling between regions West Europe (lower price) and UK South (higher price)
15.4. See attached example of Databricks pricing calculator estimate
15.5. Databricks SKUs
15.5.1. Choose your SKU based on workload type and environment, which attracts different DBU hourly charge rates
15.5.2. Two SKU categories: Standard vs Premium
15.5.2.1. Standard is cheaper than premium
15.5.2.2. Premium gives you access to more advanced security features, including role-based access control (RBAC)
15.5.3. Three workload categories:
15.5.3.1. Data Engineering Light
15.5.3.1.1. Least expensive DBU charge, supports running pre-compiled modules
15.5.3.2. Data Engineering
15.5.3.2.1. Mid DBU charge, supports interactive notebook execution, ML and Delta Lake scenarios
15.5.3.3. Data Analytics
15.5.3.3.1. Most expensive DBU charge, supports everything including full collaboration features, Power BI intergation, etc.
15.5.3.4. Remember that Apache Spark examples will work in Databricks
15.5.3.5. It is not clear to me that you can "choose" these workload categories via cluster configuration
15.5.3.5.1. I suspect that out of the box, you have all the features and when you run processes interactively in Databricks you will attract the DBU charges associated with Data Analytics, but you can develop automated jobs that can run repeatedly and those jobs runs attract lower DBU charges depending on their nature
16. Loading data
16.1. Supported data sources include:
16.1.1. Manual upload from local machine and files already in DBFS
16.1.1.1. Delimited, ORC and Parquet files
16.1.1.1.1. Note: fixed width files not supported
16.1.1.2. More formats are supported too, including JSON, Avro and Binary files
16.1.1.2.1. Binary files looks potentially interesting as a catch-all for more complex text-based files, such as complex JSON files
16.1.2. Azure Blob Storage
16.1.3. Azure Data Lake Storage
16.1.4. Kafka
16.1.5. Redis
16.1.6. (Almost) anything supporting JDBC
16.2. Supported structures
16.2.1. DataFrames
16.2.1.1. Prerequisite structure for using Spark SQL
16.2.1.2. This is the "traditional" Spark data structure
16.2.1.3. Has a table-like structure with rows and columns, including column headers
16.2.1.4. Lots of examples on Internet
16.2.2. Delta Lake
16.2.2.1. Supports ACID transactions
16.2.2.2. Can combine streaming and batch data
16.2.2.3. Can enforce schema or allow schema drift, depending on your use case
16.2.2.4. Supports data versioning
16.2.3. Temporary Views
16.2.3.1. Exist for duration of notebook
16.2.3.2. Useful for isolated data structures
16.2.3.3. Global temporary views designed for "applications" based on multiple notebooks, all of which can share the global temporary view for the duration of the application
16.2.4. Views
16.2.4.1. Permanent, accessible by all notebooks and persist across all sessions
16.2.4.2. Metadata only, the data is not materialized (i.e. no copy of the data referenced by the view is made)
16.2.5. Tables
16.2.5.1. Data is permanently materialized
16.2.5.2. Typically stored in Hive metastore
16.2.5.3. Allows inserts only
16.2.5.3.1. If you require updates or deletes to a Databricks table, you need to drop and recreate table (i.e. clear it down and re-insert new content from scratch)
16.2.6. Delta Tables
16.2.6.1. Tables created with keywords USING DELTA
16.2.6.2. Allows Inserts, Updates and Deletes
16.2.6.3. Stores version history and supports temporal table queries
16.2.6.4. Delta Tables is an open source project but is unique to Databricks
16.2.6.5. Delta tables support relational database concept of primary and foreign keys but these are informational and not enforceable constraints
16.2.6.5.1. Catalyst (the query optimizer for Spark SQL) uses primary and foreign keys for producing execution plans
16.3. Import data process
16.3.1. Data | Add Data | Create New Table | Upload File
16.3.1.1. Browse and navigate to local file for upload
16.3.1.1.1. The file will be uploaded into DBFS
16.3.1.1.2. For training demo, we uploaded ratings.csv from MovieLens database, which is 0.7GB
16.3.1.2. Once file uploaded, you can use it to create a table using the UI or a notebook
16.3.1.2.1. Create table with UI
16.3.2. Bringing in data from Azure Blob Storage
16.3.2.1. Upload CSV file to Azure Blob Storage using Azure Storage Explorer
16.3.2.1.1. Data | Add Data | Create New Table | Other Data Sources
16.3.3. Bringing in data from Azure SQL Server
16.3.3.1. Data | Add Data | Create New Table | Other Data Sources
16.3.3.1.1. Set Connector to JDBC
17. Notebooks
17.1. Notebooks provide the integrated development environment (IDE) for developing in Databricks, and they also provide the means to document your development
17.1.1. You can combine code, images and documentation in one document
17.2. Notebooks help solve the replication problem
17.2.1. Scripts are usually developed iteratively, and during their lifecycle they will transfer from one developer to another
17.2.2. When script handover occurs, a common issue is that it fails to run for new developer and you have the "worked on my machine" scenario
17.2.2.1. Notebooks really help to avoid the "worked on my machine" scenario
17.2.3. Notebooks hold state like REPLs
17.2.3.1. REPL stands for Read-Evaluate-Print-Loop
17.2.3.1.1. The idea is that you read code interactively, evaluate it, print the results and loop back to the next code block to read
17.2.3.2. Applied to notebooks, each code block (cell) is read, evaluated and results displayed in the notebook, and this process is repeatable for each cell in the notebook
17.2.3.2.1. Furthermore, each code block execution can change the state of variables that other code blocks in the notebook can access
17.2.3.3. Even after the cluster has been terminated and you've logged out of Azure, you still get to see all the results of cell executions (e.g. schema details, tabular query results, visualisations, etc.)
17.2.3.4. Via the Clear menu, you have the option to clear state or results, or both, and re-run the notebook end to end
17.3. Documentation is enabled using the Markdown or HTML syntax
17.3.1. Markdown
17.3.1.1. Start cell with %md
17.3.1.2. Press ESC to switch from Edit mode to Command mode, which renders and displays your markdown in the block
17.3.1.3. # <some text>
17.3.1.3.1. Heading 1 font
17.3.1.3.2. Repeat # up to 6 times to get Headings 1 to 6
17.3.1.3.3. examples:
17.3.1.4. *<some text>*
17.3.1.4.1. Italicise text
17.3.1.4.2. example:
17.3.1.5. **<some text>**
17.3.1.5.1. Bold text
17.3.1.5.2. example:
17.3.1.6. ***<some text>>***
17.3.1.6.1. Bold + italics
17.3.1.6.2. example:
17.3.1.7. >
17.3.1.7.1. block quote
17.3.1.8. `<some text>`
17.3.1.8.1. Inline code
17.3.1.8.2. example:
17.3.1.8.3. Note: requires backtick, not single quote
17.3.1.8.4. Use 3 backticks for multiline code
17.3.1.9. -, + or *
17.3.1.9.1. bullet list
17.3.1.10. ---
17.3.1.10.1. Section divider line
17.3.1.11. 1.
17.3.1.11.1. numbered list
17.3.1.12. <>
17.3.1.12.1. Create a mailto link
17.3.1.13. Tables can be created using vertical pipes, with dashes on second line to indicate first line is a header for table
17.3.1.13.1. See attached example
17.3.2. Note: to use HTML tags in a cell, you also need to start cell with %md
17.3.2.1. Links the the <a> tag
17.3.2.1.1. See attached example
17.3.2.2. (Web) Images with the <img> tag
17.3.2.2.1. See attached example
17.4. When you create a new blank notebook, you choose its default language
17.4.1. Magics are commands that let you change the code in a cell to a language other than the notebook's default
17.4.1.1. You reference magics at top of cell with %lang
17.4.1.1.1. %python
17.4.1.1.2. %r
17.4.1.1.3. %scala
17.4.1.1.4. %sql
17.4.1.1.5. %sh
17.4.1.1.6. %fs
17.4.1.1.7. %md
17.5. When you chose the Databricks Standard SKU, the Permissions menu will be greyed out for notebooks
17.5.1. You can check your SKU (pricing tier) via Azure Portal
17.6. Comments can be added, which works in a similar way to Microsoft Word
17.6.1. Comments attach to cells and could be a perfect place for peer review comments
17.7. Keyboard shortcuts
17.7.1. You can see a list of these any time by clicking the keyboard icon
17.7.2. ESC
17.7.2.1. With cell selected and in Edit mode, this switches to Control mode
17.7.2.1.1. For cells with Markdown content (i.e. started with %md), this will trigger the markdown to render
17.7.2.1.2. For cells with code, this simply exits Edit mode and displays the code
17.7.3. Shortcuts are grouped by two modes: Edit mode and Command mode
17.7.3.1. Edit mode
17.7.3.1.1. When you are inside a cell, editing its contents
17.7.3.1.2. Press ENTER to switch selected cell to Edit mode
17.7.3.1.3. CTRL-ENTER
17.7.3.1.4. CTRL-ALT-D
17.7.3.1.5. CTRL-ALT-X
17.7.3.1.6. CTRL-ALT-C
17.7.3.1.7. CTRL-ALT-V
17.7.3.1.8. CTRL-ALT-P
17.7.3.1.9. CTRL-ALT-N
17.7.3.1.10. CTRL-ALT-F
17.7.3.1.11. ALT-ENTER
17.7.3.1.12. SHIFT-ENTER
17.7.3.1.13. CTRL-]
17.7.3.1.14. CTRL-[
17.7.3.1.15. CTRL-/
17.7.3.2. Command mode
17.7.3.2.1. When you have taken a step back from editing and are thinking about code execution, notebook navigation, or making structural changes (adding/deleting cells, etc.)
17.7.3.2.2. Press ESC to switch current cell being edited to Command mode
17.7.3.2.3. CTRL-ENTER
17.7.3.2.4. D-D
17.7.3.2.5. Shift-D-D
17.7.3.2.6. X
17.7.3.2.7. C
17.7.3.2.8. V
17.7.3.2.9. SHIFT-V
17.7.3.2.10. A
17.7.3.2.11. B
17.7.3.2.12. O
17.7.3.2.13. SHIFT-M
17.7.3.2.14. CTRL-ALT-F
17.7.3.2.15. SHIFT-ENTER
17.7.3.2.16. Z
18. Dashboards
18.1. This is a notebook feature, which allows you to develop custom dashboards that allow you to present tables and visualisations from your notebook
18.2. Create a new dashboard in your notebook
18.2.1. View | New dashboard
18.2.1.1. 1. Set name of dashboard
18.2.1.1.1. e.g. Test Dashboard
18.2.1.2. 2. Arrange items on the dashboard
18.2.1.2.1. You get all your markdown cells and all the code cell results that return tables and other visualisations
18.2.1.2.2. You can resize tiles or delete them
18.2.1.2.3. For tables and chart tiles, you can click the settings icon on the tile and set a title for the tile
18.2.1.2.4. You can change the dashboard width in order to fit more tiles horizontally
18.2.1.3. 3. Click Present Dashboard
18.2.1.3.1. Your screen presents dashboard in full screen mode
18.3. You can create multiple dashboards per notebook
19. Scheduling jobs
19.1. Scheduling jobs from notebooks
19.1.1. Databricks automatically uses non-interactive clusters for scheduled jobs
19.1.1.1. This means it won't use any clusters you may have provisioned yourself
19.1.1.2. The benefit of this is cost saving - the job clusters attract lower costs than the interactive clusters
19.1.1.3. The clusters auto terminate once notebook completes
19.1.2. From inside a notebook, click the Schedule button to create a schedule
19.1.3. Once a schedule has been created, the job appears in the Jobs list
19.1.3.1. When the status icon is green, this indicates the job is active and running right now
19.1.4. You can drill into job results when a job fails
19.1.4.1. In my case, my first job failed because the default size of my job cluster was 8 worker nodes and that exceeded my quota
19.1.4.1.1. This is easily remedied by editing the job and then the job cluster to reduce the number of worker nodes
19.1.4.2. If the notebook has a dependency on an external library, it won't be sufficient for that library to be installed on your interactive cluster(s), which is another common cause of job failure
19.1.4.2.1. This is fixed by editing the job and clicking the option to add dependent libraries
19.2. You can create multiple schedules for a single notebook
19.2.1. Typical use case for this is when a notebook uses parameters and you want different scheduled runs to use different parameter valuyes
19.3. You can delete jobs from the Jobs list if they are no longer needed
19.4. You can edit jobs and remove their schedule
19.4.1. This means that they can only be run manually via the UI
19.4.2. This also means that the job can be invoked using a REST API call
19.4.2.1. Think Data Factory pipeline invoking a Databricks job at the appropriate moment in a daily batch process
19.5. Databricks saves information about job runs for 60 days
19.6. Scheduling jobs outside of notebooks
19.6.1. Databricks allows you to schedule and execute packaged Java code for Spark, packaged in a JAR file
19.6.2. Under Jobs, you have options to set JAR or configure spark-submit
19.6.2.1. I think you use the set JAR dialog when the JAR file does not yet exist in DBFS and the configure spark-submit if it's already in DBFS
19.6.3. See attached example we used in the PragmaticWorks course
19.6.4. Once created, the job can be run like any other job, either via a schedule or manually, and results reviewed via the logs
20. Databricks REST API
20.1. Databricks has multiple REST APIs, including the following:
20.1.1. Clusters API
20.1.1.1. Some of the actions supported via the API:
20.1.1.1.1. List
20.1.1.1.2. Get
20.1.1.1.3. Create
20.1.1.1.4. Edit
20.1.1.1.5. Start / Restart
20.1.1.1.6. Terminate / Delete
20.1.2. DBFS API
20.1.2.1. Some of the supported actions via the API:
20.1.2.1.1. List
20.1.2.1.2. Mkdirs
20.1.2.1.3. Create
20.1.2.1.4. Delete
20.1.2.1.5. Move
20.1.2.1.6. Read
20.1.3. Jobs API
20.1.3.1. Some of the supported actions via the API:
20.1.3.1.1. Create
20.1.3.1.2. List
20.1.3.1.3. Get
20.1.3.1.4. Delete
20.1.3.1.5. Run Now
20.1.3.1.6. Runs List / Get / Export /Cancel
20.1.4. Libraries API
20.1.4.1. Some of the supported actions via the API:
20.1.4.1.1. All Cluster Statuses
20.1.4.1.2. Cluster Status
20.1.4.1.3. Install
20.1.4.1.4. Uninstall
20.1.5. Secrets API
20.1.5.1. Actions pertaining to:
20.1.5.1.1. Secret scopes
20.1.5.1.2. Secrets
20.1.5.1.3. ACLs
20.1.6. Workspace API
20.1.6.1. Some of the supported actions via the API:
20.1.6.1.1. Delete
20.1.6.1.2. Export
20.1.6.1.3. Get Status
20.1.6.1.4. Import
20.1.6.1.5. List
20.1.6.1.6. Mkdirs
20.2. All of these APIs allow us to interact with Databricks without using the UI
20.2.1. This becomes useful for automation scenarios
20.3. Access Databricks REST API using Postman
20.3.1. Postman is a user friendly tool for using REST APIs, and is free for personal use
20.3.2. Start by creating a new access token for Postman in Databricks
20.3.3. After starting up Postman, create a new request and configure the Authorization
20.3.3.1. Set Type to Bearer token
20.3.3.2. Paste in the Databricks access token
20.3.4. Prepare URL with REST API call and click Send button (HTTP GET method)
20.3.4.1. The base URL can be either of the following:
20.3.4.1.1. <region>.azuredatabricks.net
20.3.4.1.2. <instance>.azuredatabricks.net
20.3.4.2. To direct the request to the REST API, you append "/api/2.0" to the base URL
20.3.4.2.1. See link for confirmation, as version of API may change over time
20.3.4.2.2. The final part of the URL identifies the required API and the required action
20.3.4.3. Here's another example using get action in clusters API
20.3.4.3.1. Note that the "?" following get introduces what are known as "query parameters"
20.3.5. The HTTP GET method can only be used for certain API actions that just return info (in Json format)
20.3.5.1. Other API actions require the HTTP POST method
20.3.5.1.1. Unlike the GET method, the POST method always requires a Body
20.3.5.1.2. Here's an example using create action in clusters API
20.4. Access Databricks REST API using PowerShell
20.4.1. There is a community-driven PowerShell module by Gerhard Brueckl that interacts with the Databricks APIs
20.4.1.1. Run PowerShell as Administrator and run the following commands to set things up:
20.4.1.1.1. Install-Module DatabricksPS
20.4.1.1.2. Import-Module DatabricksPS
20.4.1.2. In the PowerShell gallery, under package details, you can see the full list of functions available for the DatabricksPS module
20.4.2. We need a Databricks access token in order to store this in a variable using PowerShell
20.4.2.1. Run this command in PowerShell:
20.4.2.1.1. $accessToken = "<databricks_access_token>"
20.4.3. The final bit of setup before running cmdlets that invoke the Databricks REST API is to set the Databricks environment
20.4.3.1. Run these commands in PowerShell:
20.4.3.1.1. $apiRootUrl = "https://uksouth.azuredatabricks.net"
20.4.3.1.2. Set-DatabricksEnvironment -AccessToken $accessToken -ApiRootUrl $apiRootUrl
20.4.4. Examples:
20.4.4.1. Get-DatabricksCluster
20.4.4.1.1. Note that this only lists the interactive clusters, but if you want to see all the job clusters too, you can run this command:
20.4.4.1.2. Repeated use of the <Tab> in PowerShell is really useful for seeing what the options are both in the context of cmdlets and parameters
20.4.4.2. Add-DatabricksCluster
20.4.4.2.1. 4 parameters are mandatory:
20.4.4.3. Stop-DatabricksCluster
20.4.4.3.1. -ClusterID parameter (string) is required
20.4.4.4. Export-DatabricksWorkspaceItem
20.4.4.4.1. Using this function, we can export notebooks to a local file system
20.4.4.4.2. 3 required parameters:
20.4.4.5. Get-DatabricksFSItem
20.4.4.5.1. Using this function we can browse the DBFS for our environment
20.4.4.5.2. -Path
20.4.4.5.3. -ChildItems
20.4.4.6. Upload-DatabricksFSFile
20.4.4.6.1. Using this function we can upload files to DBFS from the local file system
20.4.4.6.2. 2 required parameters:
20.4.4.7. Remove-DatabricksFSItem
20.4.4.7.1. Using this function we can delete files and directories in DBFS
20.4.4.7.2. -Path
21. Databricks security
21.1. There are a number of features that are supported only when the provisioned Azure Databricks service is tied to the premium pricing tier
21.2. Admin console
21.2.1. Users
21.2.1.1. The Admin and Allow cluster creation options are checked but inaccessible for standard pricing tier
21.2.1.2. When you add users they must first exist in Azure Active Directory
21.2.1.2.1. You can add users that don't exist in AAD but they won't be able to access Azure Databricks until they are set up in AAD
21.2.2. Groups
21.2.2.1. When creating a new group, you can add either existing users or groups as members
21.2.2.2. You can set entitlements for the group, which only specifies whether or not group members can create clusters
21.2.2.2.1. For the standard tier, these settings are available but essentially useless I believe because under standard pricing all added users have the Admin role and permission for cluster creation
21.2.2.3. When using group hierarchies (groups added to groups) bear in mind that the entitlements of the parent group automatically apply down to the child
21.2.2.3.1. So an apparent "deny" permission (i.e. create cluster permission turned off) at child group level is overridden by an "allow" (cluster creation) permission at the parent group level
21.2.3. Workspace Storage
21.2.3.1. Deleted notebooks, folders, libraries, and experiments are recoverable from the trash for 30 days
21.2.3.1.1. Clicking Purge for workspace storage allows you to "empty the trash" permanently and stop paying for the storage of that trash, but you will no longer be able to recover its contents
21.2.3.2. Notebook revision history is automatically maintained, which is great but the more revision history is built up for a notebook, the more that adds to storage cost
21.2.3.2.1. Clicking Purge for revision history in combination with choosing a timeframe allows you to stay on top of revision history and get rid of everything permanently that was captured outside of selected timeframe
21.2.3.3. Clusters automatically maintain event logs, driver logs and metric snapshots, even for terminated clusters, which builds up over time and consumes storage in the background
21.2.3.3.1. Clicking Purge for cluster logs permanently gets rid of all cluster logs (event and driver) and metric snapshots
21.2.4. Access Control
21.2.4.1. 3 out of 4 options are disabled and can only be enabled when the premium pricing tier is selected
21.2.4.1.1. Workspace access control (premium only - always disabled in standard tier)
21.2.4.1.2. Cluster, pool and jobs access control (premium only - always disabled in standard tier)
21.2.4.1.3. Table access control (premium tier only - always disabled in standard tier)
21.2.4.2. Personal access tokens is enabled by default and setting is controllable via standard and premium tiers
21.2.5. Advanced
21.2.5.1. In the PragmaticWorks training course, there was only one option here relating to enabling a runtime for genomics but when I was studying this in Oct 2020 the options here expanded greatly to 12
21.2.5.1.1. It appears that all advanced options are available in standard tier
21.2.5.1.2. A few notable things that can be controlled here include the following (all enabled by default):
21.2.6. Global Init Scripts
21.2.6.1. This option did not even exist when PragmaticWorks created their Databricks training course
21.2.6.2. Global init scripts run on all cluster nodes launched in your workspace.
21.2.6.3. They can help you to enforce consistent cluster configurations across your workspace in a safe, visible, and secure manner.
21.3. Azure Active Directory integration
21.3.1. Enables Single Sign On (SSO) for Databricks
21.3.2. Conditional access is an AAD feature that allows you to restrict access to Databricks workspaces based upon location, requiring multi-factor authentication, etc.
21.3.2.1. We could restrict user access to Databricks only from a particular Vnet this way I think
21.3.2.2. Conditional access is only available via AAD Premium (not Standard)
21.4. System for Cross-domain Identity Management (SCIM)
21.4.1. This is an open standard for automating user provisioning
21.4.2. Databricks REST API includes a SCIM API that enables you to programmatically create, update and delete users and groups
21.5. Role Based Access Control (RBAC)
21.5.1. RBAC is available across many Azure services, including Databricks, for Identity Access Management (IAM)
21.5.2. RBAC only applies to Databricks when you choose the Premium tier
21.6. Implementing Table Access Control
21.6.1. After enabling Table Access Control via the Admin console (premium tier option only), the next step for setting up the access is to provision a cluster
21.6.1.1. To support table access control, the cluster must be provisioned with the "High Concurrency" cluster mode (not Standard)
21.6.1.1.1. The advanced option for enabling table access control is only visible after it's already been enabled via the Admin console
21.6.2. Once you have a secure cluster provisioned (i.e. a high concurrency cluster with table access control enabled) this allows you to run security related commands in SQL, for example
21.6.2.1. Example of changing table owner via SQL statement in a notebook
21.6.2.1.1. Note that this command can only work if the notebook is attached to a "secure" cluster
21.6.2.2. Example of granting a user Select access to a table
21.6.2.3. Example of denying a user access to a table
21.6.2.4. Example of granting a group Select access to a table
21.6.2.4.1. Group needs to exist under Groups in the Admin console
21.6.2.5. Example of granting a user Select permission to a database
21.6.2.6. Any attempt to run Scala or R code will fail when the notebook is connected to a cluster that supports table access control
21.7. BYO VNET
21.7.1. Stands for Bring Your Own Virtual Network
21.7.2. This is a feature that allows you to migrate your Databricks workspace into a customer-managed Vnet in Azure
21.7.2.1. You can then use another feature call VNet Peering to pair the Databricks VNet to another Azure VNet that hosts a VNet Gateway, which in turn enables secure access to on-prem data
21.7.2.2. You can grant access to specific Azure endpoints (e.g. ADLS Gen2)
21.7.2.3. You can use custom DNS
21.8. Azure AD Passthrough
21.8.1. This feature allows you to connect directly to ADLS Gen1 or Gen2 storage from Databricks using the same AAD credentials used to access Azure Databricks
21.8.1.1. Only High Concurrency clusters allowed
21.8.1.1.1. No support for Scala, only Python, SQL and R
21.8.1.2. Passthrough clusters default to using Data Lake storage, not DBFS
21.8.1.2.1. You need to update Spark Config to allow DBFS access