Integrations

Laten we beginnen. Het is Gratis
of registreren met je e-mailadres
Integrations Door Mind Map: Integrations

1. Other Data Warehouses

1.1. Clickhouse - On Prem. Installed Anywhere

1.1.1. Architecture

1.1.1.1. Control vs abstraction of compute

1.1.1.1.1. No

1.1.1.2. Supported cloud infrastructure

1.1.1.2.1. On-prem, can be installed anywhere

1.1.1.3. Isolated tenancy – option for dedicated resources

1.1.1.3.1. On-prem, hence always single tenant

1.1.1.4. Separation of storage and compute

1.1.1.4.1. On-prem, users need to provision compute

1.1.2. Scalability

1.1.2.1. Elasticity – Scaling for higher concurrency

1.1.2.1.1. On-prem with dedicated clusters, no Elasticity. Scaling is manual and migration to a bigger/smaller cluster is needed.

1.1.2.2. Elasticity – Scaling for larger data volumes and faster queries

1.1.2.2.1. On-prem with dedicated clusters, no Elasticity. Scaling is manual and migration to a bigger/smaller cluster is needed.

1.1.3. Performance

1.1.3.1. Indexes

1.1.3.1.1. Primary indexes, Skipping indexes, MergeTree Indexes, Join indexes

1.1.3.2. Compute tuning

1.1.3.2.1. On-prem, self-managed HW

1.1.3.3. Storage format

1.1.3.3.1. Columnar, supports sorted, compressed, encoded & sparsely indexed files.

1.1.3.4. Table-level partition & pruning techniques

1.1.3.4.1. Partitioning and Merge Tree Indexes

1.1.3.5. Result cache

1.1.3.5.1. No

1.1.3.6. Warm cache (SSD)

1.1.3.6.1. Yes, at indexed data-range level granularity.

1.1.3.7. Support for Semi-structured data & JSON functions within SQL

1.1.3.7.1. Yes,

1.1.3.7.2. Including Lambda expressions

1.1.4. Use Cases

1.1.4.1. Low-latency dashboards

1.1.4.1.1. Sub-second load times at TB scale

1.1.4.2. Enterprise BI

1.1.4.2.1. Limited integrations with Enterprise BI ecosystem tools.

1.1.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

1.1.4.3.1. – Sub-second load times at TB scale.

1.1.4.3.2. – Supports hundreds of concurrent queries on a single cluster.

1.1.4.4. Ad hoc

1.1.4.4.1. – Performance is dependent on predefined indexing.

1.1.4.4.2. – Coupled storage/compute means single Ad-Hoc query can easily hog cluster.

1.2. Druid - Installed anywhere

1.2.1. Architecture

1.2.1.1. Control vs abstraction of compute

1.2.1.1.1. No

1.2.1.2. Supported cloud infrastructure

1.2.1.2.1. Can be installed anywhere

1.2.1.3. Isolated tenancy – option for dedicated resources

1.2.1.3.1. Single tenant

1.2.1.4. Separation of storage and compute

1.2.1.4.1. Complex configuration of compute tier with multiple role specific nodes

1.2.1.4.2. Configurable node count

1.2.1.4.3. Configurable compute types (virtual machines or kubernetes deployment)

1.2.2. Scalability

1.2.2.1. Elasticity – Scaling for higher concurrency

1.2.2.1.1. Scale-up of nodes requires careful planning and downtime. Addition of new nodes for scale-out is possible.

1.2.2.2. Elasticity – Scaling for larger data volumes and faster queries

1.2.2.2.1. Concurrency on Druid depends on many factors including segment size, number of cores, memory size etc.

1.2.3. Performance

1.2.3.1. Indexes

1.2.3.1.1. Compressed bitmap indexes for data access and roll-ups to manage aggregations.

1.2.3.2. Compute tuning

1.2.3.2.1. On-prem, self managed HW. Druid requires infrastructure management and leverages commonly available instance types

1.2.3.3. Storage format

1.2.3.3.1. Columnar storage format with time based sorting

1.2.3.4. Table-level partition & pruning techniques

1.2.3.4.1. Restrictive Time based partitioning. Can partition based on other secondary columns.

1.2.3.5. Result cache

1.2.3.5.1. Ability to support caching on broker (Set to “off” by default)

1.2.3.6. Warm cache (SSD)

1.2.3.6.1. Yes, at much larger segment level granularity

1.2.3.7. Support for Semi-structured data & JSON functions within SQL

1.2.3.7.1. Recommend flattening JSON or translate to array prior to loading. No support for JSON parsing at query runtime

1.2.4. Use Cases

1.2.4.1. Low-latency dashboards

1.2.4.1.1. Sub-second load times at GB scale

1.2.4.2. Enterprise BI

1.2.4.2.1. Limited integrations with Enterprise ecosystem tools

1.2.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

1.2.4.3.1. Sub-second load times at GB scale

1.2.4.3.2. -Ability to support higher concurrency but requires complex infrastructure management and scaling

1.2.4.4. Ad hoc

1.2.4.4.1. Dependent on pre-defined roll-ups and bitmap indexing

1.3. Firebolt - AWS only

1.3.1. Architecture

1.3.1.1. Control vs abstraction of compute

1.3.1.1.1. Yes

1.3.1.2. Supported cloud infrastructure

1.3.1.2.1. AWS only

1.3.1.3. Isolated tenancy – option for dedicated resources

1.3.1.3.1. – Multi-tenant metadata layer

1.3.1.3.2. – Isolated tenancy for compute & storage per client

1.3.1.4. Separation of storage and compute

1.3.1.4.1. – Configurable cluster size (1-128 nodes)

1.3.1.4.2. – Configurable compute types

1.3.2. Scalability

1.3.2.1. Elasticity – Scaling for higher concurrency

1.3.2.1.1. Granular cluster resize with node types, number of nodes

1.3.2.2. Elasticity – Scaling for larger data volumes and faster queries

1.3.2.2.1. A single engine can handle hundreds of concurrent queries.

1.3.2.2.2. Adding more engines is manual.

1.3.3. Performance

1.3.3.1. Indexes

1.3.3.1.1. Primary indexes, aggregating indexes, join indexes

1.3.3.2. Compute tuning

1.3.3.2.1. Isolated control over # of nodes, with ability to tune for more/less CPU/RAM/SSSD on a node

1.3.3.3. Storage format

1.3.3.3.1. Columnar, sorted & compressed & sparsely indexed storage (code named “F3”)

1.3.3.4. Table-level partition & pruning techniques

1.3.3.4.1. User-defined table-level partitions are optional.

1.3.3.4.2. Data is automatically sorted, compressed and indexed into F3 format. Pruning at indexed data-range level, which is significantly smaller than partitions or micro-partitions.

1.3.3.5. Result cache

1.3.3.5.1. No

1.3.3.6. Warm cache (SSD)

1.3.3.6.1. Yes, at indexed data-range level granularity.

1.3.3.7. Support for Semi-structured data & JSON functions within SQL

1.3.3.7.1. Yes,

1.3.3.7.2. Including Lambda expressions

1.3.4. Use Cases

1.3.4.1. Low-latency dashboards

1.3.4.1.1. Sub-second load times at TB scale

1.3.4.2. Enterprise BI

1.3.4.2.1. Newer product with narrower Enterprise DW featureset

1.3.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

1.3.4.3.1. – Sub-second load times at TB scale.

1.3.4.3.2. – Supports hundreds of concurrent queries on a single cluster

1.3.4.4. Ad hoc

1.3.4.4.1. – Performance is dependent on predefined indexing.

1.3.4.4.2. – Decoupled storage/compute architecture allows to spin up ad-hoc resources

1.3.5. Sportscar - Not popular but quick

1.4. Comparisons.pdf

1.5. Apache Hadoop

1.5.1. • Apache Hadoop was the first major solution to distributed storage

1.5.2. • Became synonymous with big data

1.5.3. • Built for local storage, not processing

1.6. What are people looking for?

1.6.1. • New hardware and software for collecting, processing, storing, and analyzing data

1.6.2. • Integrate innovations with current systems

1.6.3. • Multicloud and containers

1.7. Comparisons

1.7.1. Snowflake Vs Terradata

1.7.1.1. Teradata vs Snowflake: Mode of Operation

1.7.1.1.1. Teradata uses hardware and software components that need to be installed on-premises for optimal usage of their services. Despite having a cloud service, it is not as popular as the usage of its propriety hardware and software service, whereas Snowflake runs a cloud solution as everything resides in the cloud. The data, software, and the SQL client used to access the Snowflake warehouse is stored and runs on cloud infrastructure. So basically, there are no hardware/software installations, configurations or maintenance, management of data, upgrades, etc. and SQL tuning are handled by Snowflake as it uses AWS, Azure, GCP hardware, and its propriety layer to manage resources and users.

1.7.1.2. 3) Teradata vs Snowflake: Size and Capacity

1.7.1.2.1. Teradata operates with a fixed size and capacity and if the need arises for more capacity, you will need to purchase additional hardware and upgrade the system thereby restructuring it.

1.7.1.2.2. Snowflake comes with unlimited storage and computes size as it offers a cloud service that can be scaled automatically at any time.

1.7.1.3. 4) Teradata vs Snowflake: Indexes

1.7.1.3.1. Teradata uses primary, secondary, and joint indexes. On Snowflakes, there is no such thing as a secondary or joint index.

1.7.1.4. 5) Teradata vs Snowflake: Collection of Statistics

1.7.1.4.1. The collection of statistics on Teradata is done by the user. You have to instruct Teradata to carry out the operation but Snowflake collects required statistics on its own without a user having to do anything.

1.7.1.5. 6) Teradata vs Snowflake: Access to Data

1.7.1.5.1. Teradata uses hashing to gain access to data stored within its system while Snowflake stores data in a micro-partition and within each micro partition, the data columnar are stored as they are compressed. Each micro partition has metadata, and access is gained by looking up the metadata.

1.7.1.6. 7) Teradata vs Snowflake: Workload Management

1.7.1.6.1. Teradata offers sophisticated workload management and partition systems. Any virtual partition can access the CPU if they are not needed by other partitions.

1.7.1.6.2. Snowflake uses the concept of a virtual warehouse to separate the workload and manages it for you.

1.7.1.7. 8) Teradata vs Snowflake: Data Distribution

1.7.1.7.1. Teradata is a shared-nothing architecture and each Teradata node works independently as they do not share their disks.

1.7.1.7.2. Snowflake is not a shared-nothing architecture rather the computing resources have access to shared data.

1.7.1.8. 9) Teradata vs Snowflake: APIs and Other Access Methods

1.7.1.8.1. Teradata has the following APIs and access methods: .NET Client API, HTTP REST, JDBC, JMS Adapter, ODBC, OLE DB.

1.7.1.8.2. Snowflake has the following APIs and access methods: CLI Client, JDBC, ODBC.

1.7.1.9. 10) Teradata vs Snowflake: Supported Programming Languages

1.7.1.9.1. The following programming languages are supported by Teradata: C, C++, Cobol, Java (JDBC-ODBC), Perl, PL/1, Python, R, and Ruby.

1.7.1.9.2. The following programming languages are supported by Snowflake: JavaScript, Node.js, and Python.

1.7.2. User's comments

1.7.2.1. From my perspective as a datawarehouse ‘feeder’ at Confluent: Those who are deeply in a cloud and have workloads that are fairly consistent but not overly complex, they’ll tend to just go to Redshift or BQ. Databricks is playing a different game with the delta lake and is way more useful for semi structured data (which is an increasing field for sure) and snowflake is popular with those who have a ton of data to crunch with ad-hoc needs. I dont see a lot of firebolt out in the wild Ultimately it’s like buying a car; do you want 7 seats to carry around a big family, or something really fast with only 2? Firebolt is pretty fair in their comparisons because they want to be the MX5 of data warehouses, they’ll leave the SUVs to someone else.

2. Overview

3. Certified Data Warehouses

3.1. Amazon Athena

3.1.1. Athena - AWS

3.1.1.1. Architecture

3.1.1.1.1. Control vs abstraction of compute

3.1.1.1.2. Supported cloud infrastructure

3.1.1.1.3. Isolated tenancy – option for dedicated resources

3.1.1.1.4. Separation of storage and compute

3.1.1.2. Scalability

3.1.1.2.1. Elasticity – Scaling for higher concurrency

3.1.1.2.2. Elasticity – Scaling for larger data volumes and faster queries

3.1.1.3. Performance

3.1.1.3.1. Indexes

3.1.1.3.2. Compute tuning

3.1.1.3.3. Storage format

3.1.1.3.4. Table-level partition & pruning techniques

3.1.1.3.5. Result cache

3.1.1.3.6. Warm cache (SSD)

3.1.1.3.7. Support for Semi-structured data & JSON functions within SQL

3.1.1.4. Use Cases

3.1.1.4.1. Low-latency dashboards

3.1.1.4.2. Enterprise BI

3.1.1.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

3.1.1.4.4. Ad hoc

3.2. Amazon Redshift

3.2.1. Redshift - AWS

3.2.1.1. Architecture

3.2.1.1.1. Control vs abstraction of compute

3.2.1.1.2. Supported cloud infrastructure

3.2.1.1.3. Isolated tenancy – option for dedicated resources

3.2.1.1.4. Separation of storage and compute

3.2.1.2. Scalability

3.2.1.2.1. Elasticity – Scaling for higher concurrency

3.2.1.2.2. Elasticity – Scaling for larger data volumes and faster queries

3.2.1.3. Performance

3.2.1.3.1. Indexes

3.2.1.3.2. Compute tuning

3.2.1.3.3. Storage format

3.2.1.3.4. Table-level partition & pruning techniques

3.2.1.3.5. Result cache

3.2.1.3.6. Warm cache (SSD)

3.2.1.3.7. Support for Semi-structured data & JSON functions within SQL

3.2.1.4. Use Cases

3.2.1.4.1. Low-latency dashboards

3.2.1.4.2. Enterprise BI

3.2.1.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

3.2.1.4.4. Ad hoc

3.2.1.5. Popular

3.2.2. Amazon Redshift: The first widely popular (and readily available) cloud data warehouse, Amazon Redshift sits on top of Amazon Web Services (AWS) and leverages source connectors to pipe data from raw data sources into relational storage. Redshift’s columnar storage structure and parallel processing makes it ideal for analytic workloads.

3.3. Snowflake

3.3.1. Snowflake - AWS, Azure, Google Cloud

3.3.1.1. Snowflake Architecture

3.3.1.1.1. Snowflake ArchitectureSnowflake’s architecture is a combination of the traditional shared-disk and shared-nothing database architectures. It consists of nodes that access a central data repository like in a shared-disk architecture but also has nodes in a cluster where each node stores a portion of the entire data set locally using MPP to process queries. This combined approach allows for a shared-disk architecture but with the performance and benefits of a shared-nothing architecture. This unique arrangement of Snowflake consist of the following parts and their functions are explained below:

3.3.1.1.2. Here

3.3.1.2. Architecture

3.3.1.2.1. Control vs abstraction of compute

3.3.1.2.2. Supported cloud infrastructure

3.3.1.2.3. Isolated tenancy – option for dedicated resources

3.3.1.2.4. Separation of storage and compute

3.3.1.3. Scalability

3.3.1.3.1. Elasticity – Scaling for higher concurrency

3.3.1.3.2. Elasticity – Scaling for larger data volumes and faster queries

3.3.1.4. Performance

3.3.1.4.1. Indexes

3.3.1.4.2. Compute tuning

3.3.1.4.3. Storage format

3.3.1.4.4. Table-level partition & pruning techniques

3.3.1.4.5. Result cache

3.3.1.4.6. Warm cache (SSD)

3.3.1.4.7. Support for Semi-structured data & JSON functions within SQL

3.3.1.5. Use Cases

3.3.1.5.1. Low-latency dashboards

3.3.1.5.2. Enterprise BI

3.3.1.5.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

3.3.1.5.4. Ad hoc

3.3.1.6. Popular with those who have a ton of data to crunch with ad-hoc needs.

3.3.2. Snowflake: Unlike Redshift or GCP which rely on their proprietary clouds to operate, Snowflake’s cloud data warehousing capabilities are powered by AWS, Google, Azure, and other public cloud infrastructure. Unlike Redshift, Snowflake allows users to pay separate fees for compute and storage, making the data warehouse a great option for teams looking for a more flexible pay structure.

3.4. Databricks

3.4.1. Databricks - AWS, Azure, Google Cloud (Datalake)

3.4.1.1. Architecture

3.4.1.1.1. Control vs abstraction of compute

3.4.1.1.2. Supported cloud infrastructure

3.4.1.1.3. Isolated tenancy – option for dedicated resources

3.4.1.1.4. Separation of storage and compute

3.4.1.2. Scalability

3.4.1.2.1. Elasticity – Scaling for higher concurrency

3.4.1.2.2. Elasticity – Scaling for larger data volumes and faster queries

3.4.1.3. Performance

3.4.1.3.1. Indexes

3.4.1.3.2. Compute tuning

3.4.1.3.3. Storage format

3.4.1.3.4. Table-level partition & pruning techniques

3.4.1.3.5. Result cache

3.4.1.3.6. Warm cache (SSD)

3.4.1.3.7. Support for Semi-structured data & JSON functions within SQL

3.4.1.4. Use Cases

3.4.1.4.1. Low-latency dashboards

3.4.1.4.2. Enterprise BI

3.4.1.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

3.4.1.4.4. Ad hoc

3.4.1.5. Different to others: Datalake

3.5. GoogleBigQuery

3.5.1. BigQuery - Google Cloud

3.5.1.1. Architecture

3.5.1.1.1. Control vs abstraction of compute

3.5.1.1.2. Supported cloud infrastructure

3.5.1.1.3. Isolated tenancy – option for dedicated resources

3.5.1.1.4. Separation of storage and compute

3.5.1.2. Scalability

3.5.1.2.1. Elasticity – Scaling for higher concurrency

3.5.1.2.2. Elasticity – Scaling for larger data volumes and faster queries

3.5.1.3. Performance

3.5.1.3.1. Indexes

3.5.1.3.2. Compute tuning

3.5.1.3.3. Storage format

3.5.1.3.4. Table-level partition & pruning techniques

3.5.1.3.5. Result cache

3.5.1.3.6. Warm cache (SSD)

3.5.1.3.7. Support for Semi-structured data & JSON functions within SQL

3.5.1.4. Use Cases

3.5.1.4.1. Low-latency dashboards

3.5.1.4.2. Enterprise BI

3.5.1.4.3. Data Apps (Customer-facing, low latency, high concurrency) (Customer-facing, low latency, high concurrency) – Dozens of second load times at 100s of GB scale.

3.5.1.4.4. Ad hoc

3.5.1.5. Popular

4. Certified ETL tools

4.1. Informatica

4.2. SSiS Pipeline Services

4.3. IBM DataStrage

4.4. Informatica IICS

4.5. Matillion

5. Certified Raw Data input.

5.1. CSV

5.2. Excel

5.3. json

5.4. xml

6. Uncertified ETL Tools

6.1. Apache nifi

6.2. talend

7. Storage

7.1. S3,

7.2. Google Cloud Storage,

7.3. Microsoft Azure Blob Storage,

7.4. Hadoop HDFS.

8. Formats

8.1. JSON,

8.2. Apache Parquet,

8.3. Apache Avro,

8.4. Apache Hudl,

8.5. Delta Lake.

9. Data Lakehouse Technologies

9.1. Performance SQL

9.1.1. Presto

9.1.2. Apache Spark

9.2. Schema

9.2.1. Parquet

9.3. ACID (Atomicity, Consistency, Isolation, and Durability))

9.3.1. Apache Hudi

9.3.2. Delta Lake

9.3.3. LakeFS

9.4. Managed

9.4.1. Databricks

9.4.2. Amazon Athena

10. Sort Certified Drivers

10.1. Amazon Aurora

10.1.1. Amazon Aurora is a relational database service

10.1.2. MySQL compatible

10.1.3. Fundamentally redesigning relational database storage for cloud environments

10.1.4. Aurora provides users with performance metrics, such as query throughput and latency.[12] It provides fast database cloning.[13]

10.1.5. Aurora is available as part of the Amazon Relational Database Service (RDS).

10.2. Amazon DynamoDB

10.2.1. Amazon DynamoDB is a fully managed proprietary NoSQL database service that supports key–value and document data structures and is offered by Amazon.com as part of the Amazon Web Services portfolio.

10.2.2. DynamoDB exposes a similar data model to and derives its name from Dynamo, but has a different underlying implementation

10.2.3. DynamoDB differs from other Amazon services by allowing developers to purchase a service based on throughput, rather than storage. If Auto Scaling is enabled, then the database will scale automatically.[8]

10.3. Apache Impala

10.3.1. Hadoop-based

10.3.2. Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.[2] Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012

10.3.3. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.

10.3.4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.

10.4. Apache Spark

10.4.1. Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

10.4.2. Apache Spark has built-in support for Scala, Java, R, and Python with 3rd party support for the .NET CLR,[30] Julia,[31] and more.

10.4.3. Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics.

10.4.4. Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames,[a] which provides support for structured and semi-structured data.

10.5. SYBase (Now called SAP?)

10.5.1. Sybase, Inc. was an enterprise software and services company that produced software to manage and analyze information in relational databases.

10.6. Avro

10.6.1. What???? Apache Avro:tm: is a data serialization system.

10.6.2. More

10.6.2.1. Avro provides:

10.6.2.1.1. Rich data structures.

10.6.2.1.2. A compact, fast, binary data format.

10.6.2.1.3. A container file, to store persistent data.

10.6.2.1.4. Remote procedure call (RPC).

10.6.2.1.5. Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

10.6.2.2. Schemas

10.6.2.2.1. Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

10.6.2.2.2. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

10.6.2.2.3. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

10.6.2.2.4. Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

10.7. Azure Cosmos DB

10.7.1. NoSQL(ish) MultiModel Database Service.

10.7.2. Uses lots of APIs

10.7.2.1. E.g.

10.7.2.1.1. API Internal mapping Compatibility status and remarks

10.7.2.1.2. Containers Items

10.7.2.1.3. MongoDB Collections Documents Compatible with wire protocol version 6 and server version 3.6 of the MongoDB.[5]

10.7.2.1.4. Gremlin Graphs Nodes and edges Compatible with version 3.2 of the Gremlin specification.

10.7.2.1.5. Cassandra Table Row Compatible with version 4 of the Cassandra Query Language (CQL) wire protocol.

10.7.2.1.6. Azure Table Storage Table Item

10.7.2.1.7. etcd Key Value Compatible with version 3 of etcd.[6]

10.7.2.2. SQL

10.7.2.2.1. SQL API

10.7.2.2.2. The SQL API lets clients create, update and delete containers and items. Items can be queried with a read-only, JSON-friendly SQL dialect.[7] As Cosmos DB embeds a JavaScript engine, the SQL API also enables:

10.7.2.2.3. Stored procedures. Functions that bundle an arbitrarily complex set of operations and logic into an ACID-compliant transaction. They are isolated from changes made while the stored procedure is executing and either all write operations succeed or they all fail, leaving the database in a consistent state. Stored procedures are executed in a single partition. Therefore, the caller must provide a partition key when calling into a partitioned collection. Stored procedures can be used to make up for the lack of certain functionality. For instance, the lack of aggregation capability is made up for by the implementation of an OLAP cube as a stored procedure in the open sourced documentdb-lumenize[8] project.

10.7.2.2.4. Triggers. Functions that get executed before or after specific operations (like on a document insertion for example) that can either alter the operation or cancel it. Triggers are only executed on request.

10.7.2.2.5. User-defined functions (UDF). Functions that can be called from and augment the SQL query language making up for limited SQL features.

10.7.2.2.6. The SQL API is exposed as a REST API, which itself is implemented in various SDKs that are officially supported by Microsoft and available for .NET Framework, .NET,[9] Node.js (JavaScript), Java and Python.

10.8. Microsoft Azure SQL Server Database

10.8.1. Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability.

10.8.2. platform as a service (PaaS)

10.9. Azure Synapse Analytics

10.9.1. Azure Synapse Analytics is a limitless analytics service

10.9.2. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage and serve data for immediate BI and machine learning needs.

10.10. Micriosoft Azure Table Storage

10.10.1. Stores NoSQL tables, still structured data.

10.10.2. More

10.10.2.1. Azure Table storage stores large amounts of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud. Azure tables are ideal for storing structured, non-relational data. Common uses of Table storage include:

10.10.2.1.1. Storing TBs of structured data capable of serving web scale applications

10.10.2.1.2. Storing datasets that don't require complex joins, foreign keys, or stored procedures and can be denormalized for fast access

10.10.2.1.3. Quickly querying data using a clustered index

10.10.2.1.4. Accessing data using the OData protocol and LINQ queries with WCF Data Service .NET Libraries

10.10.2.1.5. Azure Table storage is a service that stores non-relational structured data (also known as structured NoSQL data) in the cloud, providing a key/attribute store with a schemaless design. Because Table storage is schemaless, it's easy to adapt your data as the needs of your application evolve. Access to Table storage data is fast and cost-effective for many types of applications, and is typically lower in cost than traditional SQL for similar volumes of data.

10.11. Cassandra

10.12. Couchbase

10.13. Cloudera

10.14. Teradata

10.14.1. Teradata - On prem (plus cloud, not as often used)

10.14.1.1. Teradata has the following features:

10.14.1.1.1. • Unlimited Parallelism: The database system of Teradata is based on Massively Parallel Processing (MPP) architecture. The MPP architecture divides the workload on the system evenly across the system by splitting tasks among its processes and runs them in parallel hence ensuring that each task is completed swiftly. Teradata also uses an optimizer that is designed to be parallel in its function, therefore, enhancing Teradata’s reputation as a parallel processing system.

10.14.1.1.2. • Connectivity: Teradata connects to channel-attached systems like mainframes and network attached-based systems. Teradata also supports the usage of standard SQL to connect to data stored in tables and has several extension capabilities.

10.14.1.1.3. • Shared Nothing Architecture: The type of architecture that Teradata uses is called Shared Nothing Architecture, that is, each Teradata node works independently with its Access Module Processors (AMPs) as they do not share their disks.

10.14.1.1.4. • Scalability: The Teradata system is highly scalable and can be scaled up to about 2048 Nodes by simply doubling the capacity of the system by increasing the number of AMPs.

10.14.1.1.5. • Automatic Distribution: Teradata has an automatic distribution system that shares data evenly to the disks without any human interference.

10.14.1.1.6. Utility: Teradata has a wide range of usage and it is suitable for any type of user be it organizations, enterprises, or private application users. It can handle various tasks such as import and export to and from other database systems

10.14.1.2. Teradata Architecture

10.14.1.2.1. Teradata acts as a single data store that accepts a large number of concurrent requests from multiple client applications and executes them in parallel along with load distribution among several users. Teradata’s architecture is made up of the following components:

10.14.1.2.2. Access Module Processor (AMP) – It is called a virtual processor and its purpose is to store and retrieve data. When a client wants to store data, the parsing engine sends the records to BYNET which in turn sends the row to the target AMP, then the AMP stores this record on its disk. For retrieval, when a client runs a query to get records, the parsing engine sends a request to BYNET which in turn sends a retrieval message to the AMPs. The AMP searches the disk in parallel to identify the record for forwarding to BYNET and from BYNET the record is sent to the parsing engine and then the user.

10.15. Workday

10.16. denodo

10.16.1. Data Virtualisation

10.17. Exasol

10.18. Google Cloud Spanner

10.19. Greenplum

10.20. Vertica

10.21. Cloudant

10.22. IBM DB2

10.23. Salesforce

10.24. Saphana

10.25. Servicenow

10.26. kafka

10.27. MariaDB

10.28. MarkLogic

10.29. Micrsoft Dynamics365

10.30. MicrosoftSQLServer

10.31. mongoDB

10.32. MySQL

10.33. NetEzza

10.34. Oracle

11. Certified (Business Intelligence) BI Integrations

11.1. Tableau

11.2. PowerBI

11.3. Looker

11.4. MicroStrategy

12. Certified ERP/CRM Integrator

12.1. SAP

12.2. Oracle

12.3. Microsoft Dynamics 365

12.4. Salesforce

12.5. PeopleSoft

12.6. JDedwards

12.7. Siebel

13. Certified Data Lake Storage

13.1. AmazonS3

13.1.1. Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface

13.1.2. Amazon S3 manages data with an object storage architecture[6] which aims to provide scalability, high availability, and low latency with high durability.[3]

13.1.3. Requests are authorized using an access control list associated with each object bucket and support versioning[10] which is disabled by default.[11] Since buckets are typically the size of an entire file system mount in other systems, this access control scheme is very coarse-grained. In other words, unique access controls cannot be associated with individual files.

13.1.4. Types

13.1.4.1. Amazon S3 Standard is the default. It is general purpose storage for frequently accessed data.

13.1.4.2. Amazon S3 Standard-Infrequent Access (Standard-IA) is designed for less frequently accessed data, such as backups and disaster recovery data.

13.1.4.3. Amazon S3 One Zone-Infrequent Access (One Zone-IA) performs like the Standard-IA, but stores data only in one availability zone.

13.1.4.4. Amazon S3 Intelligent-Tiering moves objects automatically to a more cost-efficient storage class.

13.1.4.5. Amazon S3 on Outposts brings storage to installations not hosted by Amazon.

13.1.4.6. Amazon S3 Glacier Instant Retrieval is a low-cost storage for rarely accessed data, but which still requires rapid retrieval.

13.1.4.7. Amazon S3 Glacier Flexible Retrieval is also a low-cost option for long-lived data; it offers 3 retrieval speeds, ranging from minutes to hours.

13.1.4.8. Amazon S3 Glacier Deep Archive is another low-cost option.[17][better source needed]

13.2. Azure Data Lake Storage

13.2.1. Azure Data Lake[1] is a scalable data storage and analytics service.

13.2.2. Azure Data Lake Store - Users can store structured, semi-structured or unstructured data produced from applications including social networks, relational data, sensors, videos, web apps, mobile or desktop devices. A single Azure Data Lake Store account can store trillions[3] of files where a single file can be greater than a petabyte in size.

13.2.3. Using Data Lake Analytics, users can develop and run parallel data transformation and processing programs in U-SQL, a query language that combines SQL with C#.

13.2.4. It is based on COSMOS

13.3. In the data lake vs data warehouse debate, consider data lakes are the do-it-yourself version of a data warehouse, allowing data engineering teams to pick and choose the various metadata, storage, and compute technologies they want to use depending on the needs of their systems. Common data lake technologies include:

14. Unknown

14.1. DAX

14.2. Adverity

14.2.1. https://www.adverity.com/

14.3. Snaplogic

14.3.1. SnapLogic is a commercial software company that provides Integration Platform as a Service[1] (iPaaS) tools for connecting Cloud data sources, SaaS applications and on-premises business software applications.

14.3.1.1. SnapLogic's Elastic Integration Platform consists of an Integration Cloud, prebuilt connectors called Snaps and a Snaplex for data processing in the cloud or behind the firewall. The company's products have been referred to as targeting the Internet of Things marketplace for connecting data, applications and devices.[4]

14.3.1.2. The Integration Cloud approaches big data integration through the following tools:

14.3.1.3. Designer: An HTML5-based user interface for specifying and building integration workflows, called pipelines.

14.3.1.4. Manager: Controls and monitors the performance of SnapLogic orchestrations and administers the lifecycle of data and process flows.

14.3.1.5. Dashboards: Provides visibility into the health of integrations, including performance, reliability, and utilisation.

14.3.1.6. The Snaplex is a self-upgrading, elastic execution grid that streams data between applications, databases, files, social and big data sources. The Snaplex can run in the cloud, behind the firewall and on Hadoop.[5]

14.4. Azure Data Factory

14.4.1. SQL. Fully managed. Serverless

14.4.1.1. Integrate all of your data with Azure Data Factory – a fully managed, serverless data integration service.

14.4.1.2. Visually integrate data sources with more than 90 built-in, maintenance-free connectors at no added cost.

14.4.1.3. Easily construct ETL and ELT processes code-free in an intuitive environment or write your own code.

14.4.1.4. Then deliver integrated data to Azure Synapse Analytics to unlock business insights.

14.5. Unknown

14.5.1. Trino

14.5.2. Apache Flink

14.5.3. Presto

14.5.4. Hive

14.5.5. Apache Spark

15. What is this

15.1. Databricks has enabled users to add structure and metadata via the Unity Catalog and Delta Lake while Snowflake introduced Apache Iceberg tables to bring the reliability and simplicity of SQL tables, while making it possible for engines like Apache Spark, Trino, Apache Flink, Presto, and Hive to safely work with the same tables, at the same time.

15.2. What!?

15.2.1. They also recently introduced Snowpark Python, a native Python experience with a pandas and PySpark-like API for data manipulation without the need to write verbose SQL. On the other side, Spark SQL can help turn languages like Python, R, and Scala into SQL commands.

16. Metadata:

16.1. Hive,

16.1.1. Apache Hive

16.1.1.1. Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.[3]

16.1.1.2. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

16.1.1.3. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio.

16.1.1.4. More

16.1.1.4.1. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

16.1.1.4.2. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.[4]

16.1.1.4.3. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio. It provides a SQL-like query language called HiveQL[8] with schema on read and transparently converts queries to MapReduce, Apache Tez[9] and Spark jobs. All three execution engines can run in Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 [10] Other features of Hive include:

16.1.1.4.4. Different storage types such as plain text, RCFile, HBase, ORC, and others.

16.1.1.4.5. Metadata storage in a relational database management system, significantly reducing the time to perform semantic checks during query execution.

16.1.1.4.6. Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.

16.1.1.4.7. Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.

16.1.1.4.8. SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

16.1.1.4.9. By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[11]

16.1.1.4.10. The first four file formats supported in Hive were plain text,[12] sequence file, optimized row columnar (ORC) format[13] and RCFile.[14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[15][16]

16.2. Amazon Glue,

16.3. Databricks.

17. Compute

17.1. Apache Pig,

17.2. Hive,

17.3. Presto,

17.4. Spark.

18. Framework

18.1. Apache Hadoop

18.2. Apache Spark

18.3. PySpark