Azure Cosmos DB & NoSQL Basics

Get Started. It's Free
or sign up with your email address
Azure Cosmos DB & NoSQL Basics by Mind Map: Azure Cosmos DB & NoSQL Basics

1. Azure Storage Accounts

1.1. Azure Storage Accounts support multiple storage services:

1.1.1. Azure Table Storage

1.1.1.1. Azure Table storage is a service that stores non-relational structured data (also known as structured NoSQL data) in the cloud, providing a key/attribute store with a schemaless design

1.1.1.1.1. This makes Azure Table Storage a NoSQL datastore

1.1.1.2. Because Table storage is schemaless, it's easy to adapt your data as the needs of your application evolve

1.1.1.3. Access to Table storage data is fast and cost-effective for many types of applications, and is typically lower in cost than traditional SQL for similar volumes of data

1.1.1.4. You can use Table storage to store flexible datasets like user data for web applications, address books, device information, or other types of metadata your service requires

1.1.1.5. You can store any number of entities in a table, and a storage account may contain any number of tables, up to the capacity limit of the storage account

1.1.1.6. Note: Microsoft pushes Azure Cosmos DB Table API as the preferred alternative to Azure Table Storage now

1.1.1.6.1. This is considered the premium offering for Azure Table Storage, and it delivers superior performance

1.1.1.7. Use cases:

1.1.1.7.1. Storing TBs of structured data capable of serving web scale applications

1.1.1.7.2. For datasets that don't need complex joins, foreign keys or stored procedures

1.1.1.7.3. Supports clustered indexes for rapid data queries

1.1.1.7.4. Data access using OData protocol and LINQ queries via the SDK

1.1.1.8. Table storage concepts

1.1.1.8.1. Within each Azure storage account, you can create multiple tables

1.1.1.8.2. Each conceptual row within a table is referred to as an entity

1.1.1.8.3. Each entity can hold up to 252 properties as key-value pairs

1.1.1.8.4. URL access is in the format:

1.1.1.9. Table storage SLA

1.1.1.9.1. Microsoft provides an SLA for the performance of all its storage services

1.1.1.9.2. Actual times are expected to be much lower

1.1.1.9.3. At time of writing (April 2021):

1.1.2. Azure Blob Storage

1.1.2.1. Azure Blob storage is Microsoft's object storage solution for the cloud

1.1.2.2. Blob storage is optimized for storing massive amounts of unstructured data

1.1.2.3. It is very widely used by other Azure services

1.1.2.4. Use cases:

1.1.2.4.1. Serving documents to a web browser

1.1.2.4.2. Storing data for backup and restore

1.1.2.4.3. Streaming video and audio

1.1.2.4.4. Storing data for analysis

1.1.2.4.5. Storing log files

1.1.2.4.6. Storing files for distributed access

1.1.3. Azure File Storage

1.1.3.1. Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol or Network File System (NFS) protocol

1.1.3.2. Can be used to replace on-premises file servers

1.1.3.3. Supports lift-and-shift scenarios for moving on-premises applications to the cloud

1.1.4. Azure Queue Storage

1.1.4.1. Azure Queue Storage is a service for storing large numbers of messages

1.1.4.2. You access messages from anywhere in the world via authenticated calls using HTTP or HTTPS

1.1.4.3. A queue message can be up to 64 KB in size

1.1.4.4. A queue may contain millions of messages, up to the total capacity limit of a storage account

1.1.4.5. Queues are commonly used to create a backlog of work to process asynchronously

1.1.4.5.1. Example: a website that converts uploaded bitmap files to jpeg may add client requests to a queue, which the converter process reads from to process those requests asynchronously, preventing the website from being overwhelmed by too many synchronous requests

1.2. Azure Storage Accounts are accessible from anywhere in the world using HTTP or HTTPS

1.3. Azure Storage Accounts are durable, highly available, secure and massively scalable

1.4. There are 5 account types (2 of which are legacy):

1.4.1. General-purpose v1

1.4.1.1. Legacy basic storage account, use v2 instead

1.4.2. General-purpose v2

1.4.2.1. Basic storage account type for blobs, files, queues and tables

1.4.3. Block blob storage

1.4.3.1. Blob-only storage accounts with premium performance

1.4.4. FileStorage storage

1.4.4.1. Files-only storage accounts with premium performance

1.4.5. Blob storage

1.4.5.1. Legacy blob-only storage, use General-purpose v2 instead

1.5. Storage Account Security

1.5.1. Management Security

1.5.1.1. Operations that affect the storage account itself, such as the permission to create or delete a storage account

1.5.1.2. RBAC roles are used for this

1.5.1.2.1. Users, groups and applications can be assigned RBAC roles for specific storage accounts

1.5.2. Data Access Security

1.5.2.1. All data access is blocked by default

1.5.2.2. Storage Account Keys grant complete access to all data in storage account

1.5.2.3. Shared Access Signatures (SAS) give required permissions for a limited period of time

1.5.2.3.1. This is recommended way to grant data access

1.5.2.3.2. SAS tokens can be generated via the Azure Portal or programmatically

1.5.2.3.3. They are typically signed using the account storage key

1.5.2.3.4. Once generated, you get both the token and a URI that can be shared with the user or application that needs access

1.5.2.3.5. As well as restricting both the permission and the time period, you can limit the IP addresses that can use the SAS token and limit to HTTPS only (i.e. no HTTP)

1.5.3. Encryption for data in transit

1.5.3.1. HTTPS is recommended when calling the REST APIs or accessing objects in storage

1.5.3.1.1. HTTPS ensures encryption of all data in transit

1.5.3.1.2. When securing access to data in a storage access using SAS tokens, you can apply the option to enforce use of HTTPS

1.5.4. Encryption for data at rest

1.5.4.1. Storage Service Encryption (SSE)

1.5.4.1.1. This is automatically enabled for all storage accounts and ensures that all data is securely encrypted at rest

1.5.4.1.2. SSE cannot be manually disabled

1.5.4.2. Client-side encryption

1.5.4.2.1. This involves programmatically encrypting data in the client application and then sending it across the wire

1.5.4.2.2. This can be done using the Azure Storage Account SDK

1.5.5. Auditing and monitoring access

1.5.5.1. We can add/enable diagnostic settings in our storage account via the Azure Portal, which can log storage read, write and delete events

1.5.6. CORS

1.5.6.1. Stands for Cross Origin Resource Sharing

1.5.6.2. When a web browser gets a web page from a web server, that page might include resources from domains other than that of the web server

1.5.6.2.1. In response to processing the HTTP(S) requests from a web page, any requests that stray from the original web server domain are called cross-origin HTTP requests

1.5.6.2.2. For security reasons, web browsers will not allow this type of action unless it is processed correctly using CORS

1.5.6.2.3. Under CORS, the web browser will make something called a pre-flight request of the second web server

1.5.6.3. Without CORS settings enabled on the storage account service to define allowed origins (domains), methods and headers, all cross-origin HTTP requests will be refused by the storage account service

2. Azure Cosmos DB overview

2.1. Azure Cosmos DB is Microsoft's NoSQL database

2.1.1. Azure Cosmos DB is a fully managed platform-as-a-service (PaaS)

2.1.2. To begin using Azure Cosmos DB, you should initially create an Azure Cosmos account in your Azure subscription and databases, containers, items under it

2.1.3. The Azure Cosmos account is the fundamental unit of global distribution and high availability

2.1.4. Your Azure Cosmos account contains a unique DNS name and you can manage an account by using the Azure portal or the Azure CLI, or by using different language-specific SDKs

2.2. Key concepts to understand:

2.2.1. Global distribution and multi-homing

2.2.1.1. For globally distributing your data and throughput across multiple Azure regions, you can add and remove Azure regions to your account at any time

2.2.1.2. You can configure your account to have either a single region or multiple write regions

2.2.1.3. Your application is aware of the nearest region and can send requests to that region

2.2.1.3.1. Nearest region is identified without any configuration changes

2.2.1.3.2. When new region is added or removed, the connection string stays the same

2.2.1.4. How multi-homing APIs work

2.2.1.4.1. Imagine that we have added two regions to our Cosmos DB account: West US and East US

2.2.1.4.2. We've deployed two Azure App services for our website, one in the West US region and the other in the East US region

2.2.1.4.3. To handle routing for our client requests, we add a Traffic Manager

2.2.1.4.4. When a client browser request to our website is made from LA, the Traffic Manager handles this and knows where that client request originated from, automatically directing the request to the App service in West US

2.2.1.4.5. The App service in West US passes the Cosmos DB data operation request via the multi-homing API

2.2.2. Data consistency levels

2.2.2.1. Data Consistency concept

2.2.2.1.1. Imagine that an application user in West US updates a customer record (changes their credit score in this example)

2.2.2.1.2. Global replication between the Cosmos DB regions occurs every 5 seconds

2.2.2.1.3. Imagine a second user in East US initiates a ready query on the same customer record just 2 seconds after it was updated by the 1st user

2.2.2.1.4. The question is: what credit score should the 2nd user see?

2.2.2.2. Most distributed databases ask you to choose between two extreme consistency levels:

2.2.2.2.1. strong

2.2.2.2.2. eventual

2.2.2.3. Cosmos DB offers 5 consistency levels, including the two extremes: strong bounded staleness session consistent prefix eventual

2.2.2.3.1. The concept of consistent prefix can be understood in the context of multiple writes to a single record

2.2.2.3.2. For bounded staleness, you set a time interval for acceptable data staleness - let's say 10 seconds

2.2.2.3.3. Session is the default consistency level

2.2.3. Time-to-live (TTL)

2.2.3.1. The idea of TTL is to set an expiry time on Cosmos DB data items so that these items are automatically purged from the database when they expire

2.2.3.1.1. TTL is set in seconds

2.2.3.1.2. TTL is counted from the last modified time

2.2.3.2. In Azure Cosmos DB, you can choose to configure Time to Live (TTL) at the container level, or you can override it at an item level after setting for the container

2.2.3.2.1. You can configure TTL for a container by using Azure portal or the language-specific SDKs

2.2.3.2.2. Item level TTL overrides can be configured by using the SDKs

2.2.4. Data partitioning

2.2.4.1. An Azure Cosmos DB database is a unit of management for a set of containers

2.2.4.1.1. A database consists of a set of schema-agnostic containers

2.2.4.2. A logical partition consists of a set of items that have the same partition key

2.2.4.2.1. For example, in a container that contains data about food nutrition, all items contain a foodGroup property

2.2.4.2.2. When new items are added to a container, new logical partitions are transparently created by the system

2.2.4.3. Internally, one or more logical partitions are mapped to a single physical partition

2.2.4.3.1. Physical partitions are distributed internally among several machines

2.2.4.3.2. Throughput for a container is divided evenly among physical partitions

2.2.4.4. Best practices:

2.2.4.4.1. Pick partition key that doesn't result in "hot spots" (i.e. we're looking for values that have a relatively even distribution)

2.2.4.4.2. Choose partition key that has a wide range of values (again, even distribution being an important characteristic)

2.2.4.4.3. Favour partition keys that appear frequently as a filter in queries

2.2.4.4.4. A single logical partition has a limit of 20GB storage (increased from original limit of 10GB)

2.2.4.5. Batch updates can be made against all data in a logical partition as an atomic transaction

2.2.5. Multiple APIs

2.2.5.1. Cosmos DB provides 5 different APIs

2.2.5.2. SQL API

2.2.5.2.1. This is also known as the core API

2.2.5.2.2. It is the default API and is recommended for all new applications that do not require the graph database model

2.2.5.3. MongoDB API

2.2.5.3.1. This API is the choice when you want to migrate an existing application that uses MongoDB to Cosmos DB

2.2.5.4. Table API

2.2.5.4.1. This API is the choice when you want to migrate an existing application that uses Azure Table Storage to Cosmos DB

2.2.5.5. Cassandra API

2.2.5.5.1. This API is the choice when you want to migrate an existing application that uses Cassandra to Cosmos DB

2.2.5.6. Gremlin API

2.2.5.6.1. This API is the choice when you need a graph database model

2.3. Security

2.3.1. RBAC to grant permission to users, groups or applications

2.3.2. Firewall to limit clients who can access Cosmos DB

2.3.3. Cosmos DB can also be deployed into a VNet and use Network Security Groups (NSGs) to reduce the attack surface

2.3.3.1. This means we can eliminate the public endpoint for Cosmos DB in favour of a private endpoint, and effectively lock down Cosmos DB for private corporate applications

2.3.4. CORS

2.3.4.1. Websites and web applications that need to make calls to Cosmos DB can be whitelisted using CORS

2.3.5. Read-only and read-write keys

2.3.5.1. Like storage account keys but applicable to a Cosmos DB account

2.3.5.2. As suggested by the name, clients with an active read-only key will only be granted read access to Cosmos DB; they will need a read-write key in order to insert, update or delete any data

3. Cosmos DB Mongo DB API

3.1. MongoDB is a cross-platform document-oriented database program

3.1.1. It is a NoSQL database

3.1.2. Documents are stored in JSON format

3.2. Cosmos DB implements network protocols of common NoSQL databases, including MongoDB

3.2.1. This allows existing SDKs, drivers and tools of MongoDB databases to interact with Cosmos DB as if it were a MongoDB database

3.2.2. You can migrate your MongoDB application to Cosmos DB while preserving most of its logic

3.3. Cosmos DB supports MongoDB 3.6 as of time of writing (April 2021)

3.4. In the MongoDB API, containers are referred to as collections (same as for the SQL API)

3.5. Why would you migrate from MongoDB to Cosmos DB?

3.5.1. Get financially backed SLAs for the NoSQL APIs powered by Cosmos DB

3.5.2. Global distribution with multi-master replication

3.5.2.1. Multi-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group

3.5.2.1.1. All members are responsive to client data queries

3.5.2.1.2. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group and resolving any conflicts that might arise between concurrent changes made by different members

3.5.3. Elastically scale the provisioned throughput and storage for your Cosmos databases

3.5.4. Pay only for the throughput and storage you need

3.6. Pre-migration steps

3.6.1. Create an Azure Cosmos DB account with the MongoDB API

3.6.2. Estimate the throughput needed for your workload

3.6.2.1. This is measured in Request Units (RUs) per second in Cosmos DB

3.6.2.1.1. RU is an abstraction of physical resources (memory, CPU and IOPs)

3.6.2.1.2. RU is a very similar concept to DTU in Azure SQL Database

3.6.2.2. RU consumption is impacted by several variables or factors

3.6.2.2.1. The size of an item affects the number of RUs consumed to read or write each item

3.6.2.2.2. The frequency of CRUD (create, read, update, delete) operations

3.6.2.2.3. Complexity of queries

3.6.2.3. Microsoft provides a handy capacity calculator for Cosmos DB for estimating your RU requirements and the cost for this

3.6.3. Pick an optimal partition key for your data

3.6.3.1. The idea here is for your choice of partition key to lead to a relatively even distribution of the data in a container across logical partitions

3.6.4. Understand the indexing policy that you can set on your data

3.6.4.1. Azure Cosmos DB indexes all your data fields upon ingestion under the default indexing policy

3.6.4.2. It is recommended to turn off indexing when doing bulk load operations, such as a data migration

3.6.4.2.1. This can be done by changing the indexing policy for the container from consistent to none

3.7. Migration considerations

3.7.1. Azure Data Migration Service (DMS) migrates MongoDB collections with unique indexes

3.7.1.1. Bear in mind that the unique indexes must be created on empty target containers before the migration

3.7.1.1.1. You cannot create unique indexes on containers that already include data

3.7.1.1.2. To retrospectively add a unique index to a container, you will need to create a new target container and perform a bulk load operation from old to new container, then drop old & rename new

3.7.1.2. Azure DMS is a more sophisticated, enterprise grade service for data migrations compared to the Azure Cosmos DB Data Migration Tool

3.7.1.3. To use DMS, you must first ensure that, the Microsoft.DataMigration resource provider is registered for your subscription

3.7.2. After performing the data migration, you should test your application by changing the connection string from the old MongoDB one to the new Cosmos DB connection

3.7.2.1. Note that your application does not need to have any of the Azure Cosmos DB SDK imported; it should simply work using the MongoDB SDK and a change of connection string

4. Cosmos DB Gremlin (Graph) API

4.1. What is a graph data model and why do we need it?

4.1.1. Real world data is naturally connected

4.1.2. Traditional data modelling focuses on entities not relationships

4.1.3. For many applications, there is a need to model both entities and relationships

4.1.3.1. For example, most large/successful social networks use graph databases at their core

4.1.4. Graph databases persist relationships in the storage layer

4.1.4.1. This leads to highly efficient graph retrieval operations

4.1.5. Graph databases are included in the NoSQL or non-relational category because there is no dependency on a schema or constrained data model

4.1.6. There are different types of graph database

4.1.6.1. It seems that Cosmos DB supports Property Graphs via its Gremlin API

4.1.6.1.1. A Property Graph is a structure composed of vertices and edges

4.1.6.1.2. Both vertex and edge can have properties

4.1.6.1.3. See attached for simple example of a property graph, showing the relationships between two people, a cellphone and a laptop

4.1.6.1.4. Vertices

4.1.6.1.5. Edges

4.1.6.1.6. Properties

4.1.7. Use cases

4.1.7.1. Social networks

4.1.7.2. Recommendation engines

4.1.7.3. Geospatial (maps)

4.1.7.4. Internet of Things

4.2. Querying graph databases

4.2.1. The querying process for a graph database is often referred to as traversing graphs

4.2.2. Apache Gremlin is a graph traversal language

4.2.2.1. Gremlin is to graph databases what SQL is to relational databases

4.2.2.2. We use Gremlin to add, update, delete and query graph items

4.3. Cosmos DB Gremlin API is a fully managed graph database

4.3.1. By choosing Cosmos DB as the platform for your graph database, you get the benefit of elastic scaling for throughput and storage, multi-homing, automatic indexing and tunable consistency levels

4.4. In the Gremlin API, containers are referred to as graphs

4.5. Queries can be made using the Azure Cosmos DB SDK in a language of your choice (e.g. Python)

4.5.1. Alternatively, we can use an application called the Gremlin Console or use the Azure Portal (Data Explorer)

4.5.1.1. Example Gremlin queries:

4.5.1.1.1. g.V()

4.5.1.1.2. g.V().count()

4.5.1.1.3. g.E()

4.5.1.1.4. g.V('john')

4.5.1.1.5. g.V('john').out('knows').hasLabel('person')

4.5.2. Note: when creating new vertices, it is mandatory to pass a non-null value for whatever property you originally choose as the partition key

4.6. It is also possible to query your Cosmos DB Gremlin API using SQL

5. What is NoSQL?

5.1. NoSQL is a broad category for data persistence layers (databases) that are not formed by tabular, relational data structures

5.2. NoSQL databases are characterized by having no schema

5.2.1. This increases the flexibility for data storage and speeds up the process of capturing data to make it available for querying in a persistence layer, making it well suited as the basis for modern big data applications

5.2.2. Responsibility for data quality and integrity rests with the big data application developer because the NoSQL database will not ensure it on its own

5.3. NoSQL databases lend themselves to rapid, unlimited horizontal scaling and very faster query performance

5.3.1. Horizontal scaling is much easier than for relational SQL databases and can be done using relatively cheap, commodity hardware

5.4. Primary NoSQL data structures:

5.4.1. Key-value

5.4.1.1. Key-value stores pair keys and values using a hash table

5.4.2. Columnar

5.4.2.1. a.k.a. wide-column or column-family, columnar databases efficiently store data and query across rows of sparse data and are advantageous when querying across specific columns in the database

5.4.3. Document

5.4.3.1. Document databases extend the concept of the key-value database by organising entire documents into groups called collections

5.4.3.2. They support nested key-value pairs and allow queries on any attribute within a document

5.4.3.3. JSON is a common example of the document format

5.4.4. Graph

5.4.4.1. Graph databases use a model based on nodes and edges to represent interconnected data

5.4.4.2. Offer simplified storage and navigation through complex relationships

5.5. It is a common misunderstanding that NoSQL databases are for storing non relational data only, but this is wrong

5.5.1. NoSQL can be a very good choice for storing relational, tabular data

5.5.2. The key drivers for choosing NoSQL arise when the workload calls for:

5.5.2.1. Cheaper storage

5.5.2.2. Massive scale

5.5.2.3. Low latency

5.5.2.4. Global distribution

5.6. Despite the name, NoSQL databases often support SQL queries

6. Azure and NoSQL

6.1. As part of its NoSQL PaaS offering, Azure provides two key services:

6.1.1. Azure Storage

6.1.1.1. This provides the cheap, distributed storage for all types of data

6.1.1.2. In addition, Azure Data Lake Gen2 is built on top of Azure Storage

6.1.2. Azure Cosmos DB

6.1.2.1. Previously known as Document DB

6.1.2.2. A fully managed NoSQL database service for modern app development, providing multiple APIs

6.1.2.2.1. For storing JSON data, we should choose the SQL API or the MongoDB API

6.1.2.2.2. For storing key-value pairs, we should choose the Table API

6.1.2.2.3. For storing data as wide columns, we should choose the Casandra API

6.1.2.2.4. For storing data as graphs, we should use the Gremlin API

7. Cosmos DB Table API

7.1. Why would you migrate from Azure Tables to Cosmos DB Table API?

7.1.1. Before creating a new Azure Table storage account, Microsoft let's you know that the premium service for Azure Tables is actually Azure Cosmos DB Table API

7.1.2. Cosmos DB Table API offers the following that you don't get with Azure Tables

7.1.2.1. Global distribution

7.1.2.2. Dedicated throughput

7.1.2.3. Single-digit ms latencies

7.1.2.4. Guaranteed high availability with SLAs

7.1.2.5. Automatic secondary indexing

7.1.2.5.1. With Azure Tables you only get an index on the partition key

7.1.2.6. Multiple consistency level options

7.1.2.6.1. Azure Tables provides only strong consistency for a primary region and eventual for a secondary region

7.2. In the Table API, containers are referred to as tables

7.3. Queries can be made using the REST API that references the endpoint, the table name and the combination of PartitionKey + RowKey

7.3.1. https://<mytableendpoint>/People(PartitionKey='Harp',RowKey='Walter')

7.3.2. Note that PartitionKey + RowKey uniquely identifies entities

7.3.2.1. Remember that entities in this context are logically equivalent to rows in a SQL table

7.3.3. The OData syntax is supported for more advanced queries

7.3.3.1. https://<mytableapi-endpoint>/People()?$filter=PartitionKey%20eq%20'Smith'%20and%20Email%20eq%20'[email protected]'

7.3.3.2. OData stands for the Open Data Protocol, which is an open standard that defines best practice for building and consuming RESTful APIs

7.4. The other way we can interact with our Cosmos DB is via the Azure SDK of the language of our choice (e.g. Python)

7.5. To migrate Azure Table data into Azure Cosmos DB Table API, we have a couple of options

7.5.1. We can use the Azure Cosmos DB Data Migration Tool

7.5.2. We can use AzCopy

7.6. After performing the data migration, you should test your application by changing the connection string from the old Azure Tables one to the new Cosmos DB connection

7.6.1. Note that your application does not need to have any of the Azure Cosmos DB SDK imported; it should simply work using the Azure Tables SDK and a change of connection string

8. Cosmos DB Cassandra API

8.1. Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database

8.2. Wide column store

8.2.1. A wide column store is a type of NoSQL database

8.2.2. Unlike a relational database, the names and format of the columns can vary from row to row in the same table

8.2.3. Other examples of wide column store databases are: Apache HBase Azure Tables

8.3. Cosmos DB Cassandra API can be used as the data store for apps written in Apache Cassandra

8.4. Existing Cassandra applications using CQLv4 compliant drivers, can communicate with the Cosmos DB Cassandra API

8.5. Switch from Apache Cassandra to Cosmos DB Cassandra API just by updating the connection string

8.6. We can interact with Cosmos DB Cassandra API using CQL inside Azure Data Explorer

8.6.1. CQL = Cassandra Query Lanaguage

8.6.2. For those that come from a Cassandra background, we can use Cassandra-based tools and Cassandra client drivers that are likely already familiar

8.6.3. We can also use the Azure Cosmos DB Cassandra API SDK in multiple languages (including .NET and Python)

8.7. In the Cassandra API, containers are referred to as tables

8.7.1. All tables must be created inside a keyspace, which is akin to a namespace

8.7.1.1. Think of a keyspace as being similar to a SQL Server database schema

8.7.1.2. Throughput can be defined at the keyspace level

9. Azure Data Lake Storage Gen2

9.1. ADLS is a hyper-scale repository for big data analytic workloads

9.1.1. Capture data of any size, type and ingestion speed in one single place for analytics

9.1.2. Can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs

9.1.3. Is tuned for performance for data analytics scenarios

9.1.4. Includes enterprise-grade capabilities: security, manageability, scalability, reliability and high availability

9.2. ADLS is in its 2nd generation

9.2.1. Gen2 converges the capabilities of Azure Blobs and Azure Data Lake Storage Gen1

9.2.1.1. Gen1 features such as file system semantics, file level security and scale

9.2.1.2. Low-cost, high availability and disaster recovery from Azure Blob storage

9.2.2. ADLS Gen2 is built on top of Azure Blob storage

9.3. ADLS Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure

9.4. ADLS Gen2 supports hundreds of gigabits of throughput, and allows you to manage massive amounts of data

9.5. ADLS Gen2 enables a hierarchical namespace for Blob storage

9.5.1. Use slashes in the name to mimic a hierarchical directory structure

9.6. ADLS Gen2 is Hadoop compatible, which means you can managed data in same way as Hadoop HDFS

9.7. ADLS Gen2 supports POSIX permissions and ACL permissions

9.7.1. Both POSIX and ACL are standards for setting file system permissions

9.7.1.1. POSIX comes from the UNIX world

9.7.1.2. ACL comes from the Windows world

9.7.2. POSIX permissions allow you to set permissions only for the Owner, one Group, and Others

9.7.3. ACLs give you the additional option to set permissions for multiple individuals and multiple groups for a shared item

9.7.3.1. ACLs also have more types of permissions

9.8. ADLS Gen2 is cost effective because it offers low-cost storage capacity

9.9. We can use multiple tools for ADLS Gen2 file system operations

9.9.1. Azure Storage Explorer

9.9.1.1. There is a desktop version and also a version integrated into Azure Portal for the storage account

9.9.2. AzCopy

9.9.2.1. Command line tool

9.9.2.2. Example: copy a file from your local Windows file system to an ADLS Gen2 account using AzCopy

9.9.2.2.1. set ACCOUNT_NAME=<your_account_name>

9.9.2.2.2. set ACCOUNT_KEY=<your_Account_key>

9.9.2.2.3. azcopy.exe cp "<local path to file(s)>" "https://<storage_account_name>.dfs.core.windows.net/<your_filesystem_name>/<subfolder_name>" --recursive=true

9.9.2.2.4. Note: azcopy.exe cp command option --recursive enables copying of any local subfolders to ADLS Gen2

9.9.3. Azure Data Factory

9.9.4. REST API

9.9.5. DistCP tool

9.9.5.1. This is a command line tool that's part of the Apache Hadoop ecosystem

10. Cosmos DB SQL (Core) API

10.1. SQL API realises containers as collections and the items of the collections are documents in JSON format

10.2. One way we can interact with our Cosmos DB using Data Explorer in Azure Portal

10.2.1. We can pop out Data Explorer to full screen by clicking the icon in the top right (in between Settings and Feedback)

10.2.2. We can query collections using SQL

10.2.2.1. This can be a simple SELECT with a WHERE clause

10.2.2.1.1. SELECT * FROM Families f WHERE f.lastName = 'Andersen'

10.2.2.2. To reference data points contained within nested JSON structures, we use dot notation (e.g. address.state)

10.2.2.2.1. SELECT * FROM Families f WHERE f.address.state = 'NY'

10.2.2.3. We can also return new JSON structures by expressing a column as JSON, via the syntax: SELECT {"alias1":fieldX, "alias2":fieldY}

10.2.2.3.1. SELECT {"id":f.id, "city":f.address.city} AS family FROM Families f

10.2.2.4. Support for the GROUP BY clause got added in 2019, so we can do counts, such as:

10.2.2.4.1. SELECT COUNT(1) AS Families, c.address.city FROM c GROUP BY c.address.city

10.3. The other way we can interact with our Cosmos DB is via the Azure SDK of the language of our choice (e.g. Python)

10.4. When creating a new container, you have the option of adding a unique key

10.4.1. Although the Portal allows you to add a unique key, this option is currently (April 2021) only available at the moment of creating a container; to view the unique constraint or add one post container creation can only be done using the SDK

10.4.2. The unique key imposes a constraint at the container + partition level

10.4.2.1. Note: the designated unique key can have repeating values within a container as long as they only repeat once per logical partition key

10.5. Microsoft provides a free tool called the Azure Cosmos DB Data Migration Tool

10.5.1. This tool is designed for small data migrations

10.5.1.1. For larger data migrations, you may want to consider Azure Data Factory amongst other possible options, depending on your scenario

10.5.2. The tool comes in two flavours, one GUI-based and the other is command line

10.5.3. The tool runs on Windows and requires Microsoft .NET Framework 4.51 or higher

10.5.4. The tool supports migrations into Cosmos DB for all of the following data sources:

10.5.4.1. JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase Azure Cosmos containers

11. Azure Data Explorer

11.1. Azure Data Explorer is a fast and highly scalable data exploration service for log and telemetry data

11.1.1. It's a good choice if you have terabytes of logs and telemetry data to explore

11.1.2. It's also a good choice if you need to ingest and explore data from a mixture of Azure Event Hub, IoT Hub and Blob Storage

11.2. Using Azure Data Explorer we can:

11.2.1. Ingest data from IoT devices, websites, logs and more

11.2.2. Store this data in a highly scalable database

11.2.3. Query the data using Kusto (KQL)

11.2.3.1. This is same querying language used with Azure Log Analytics

11.2.4. Explore structured and unstructured data

11.3. Note: Azure Data Explorer is not the same thing as Cosmos DB Data Explorer

11.4. Azure Data Explorer workflow

11.4.1. 3 data sources are supported (all Azure services):

11.4.1.1. IoT Hub

11.4.1.2. Event Hub

11.4.1.3. Blob storage

11.4.2. You provision a cluster, which is fully managed

11.4.2.1. Inside each cluster, you create 1 or more databases

11.4.3. You ingest data from any one of the 3 supported data sources into the database(s)

11.4.4. Analytic queries are made by users or applications using the Kusto query language

11.4.5. Outbound integration to Azure is supported

11.4.5.1. For example, we can consume data from an Azure Data Explorer (ADX) database into Azure Logic Apps and Data Factory

11.4.6. We can also visualise data in an ADX database using Power BI

11.5. ADX clusters can be provisioned via the Azure Portal and also programmatically via a number of options:

11.5.1. Azure CLI

11.5.2. PowerShell

11.5.3. SDK (e.g. .NET, Python)

11.5.4. ARM templates

11.6. Kusto Query Language (KQL) is a querying language similar to SQL but different enough to require some dedicated learning time

11.6.1. Kusto operators:

11.6.1.1. Syntax is a data source (usually a table name), optionally followed by one or more pairs of the pipe character and some tabular operator

11.6.1.2. Count

11.6.1.2.1. StormEvents | count

11.6.1.3. Filter

11.6.1.3.1. StormEvents | where StartTime > datetime(2007-02-01) and StartTime < datetime(2007-03-01) | where EventType == 'Flood' and State == 'CALIFORNIA' | project StartTime, EndTime , State , EventType , EpisodeNarrative

11.6.1.4. Take n

11.6.1.4.1. StormEvents | take 5 | project StartTime, EndTime, EventType, State, EventNarrative

11.6.1.5. Sort

11.6.1.5.1. Note: some operators enable asc/desc sort via the keyword by, which means there are two different ways to return sorted results

11.6.1.5.2. StormEvents | top 5 by StartTime desc | project StartTime, EndTime, EventType, State, EventNarrative

11.6.1.5.3. StormEvents | sort by StartTime desc | take 5 | project StartTime, EndTime, EventType, EventNarrative

11.6.1.6. Aggregation (using summarize)

11.6.1.6.1. StormEvents | summarize event_count = count() by State

11.6.1.7. Render results as charts

11.6.1.7.1. StormEvents | summarize event_count=count(), mid = avg(BeginLat) by State | sort by mid | where event_count > 1800 | project State, event_count | render columnchart

11.7. ADX has a dedicated Power BI connector and also a Microsoft Excel connector

11.7.1. Using these connectors, data can be analysed using visualisations available in those tools

11.7.2. ADX also provides an ODBC and JDBC connector, which enables integrations with third party tools

11.7.2.1. For example, ADX can be integrated with Tableau and Qlik using the ODBC connector

11.8. Integrations with other Azure services (apart from Power BI)

11.8.1. Azure Data Factory

11.8.2. Microsoft flow

11.8.3. Logic Apps

11.8.4. Apache Spark

11.8.5. Azure Databricks

11.8.6. CI/CD using Azure DevOps