Get Started. It's Free
or sign up with your email address
Spark by Mind Map: Spark

1. Component

1.1. Spark core

1.2. RDD: non-structural

1.3. SparkSQL, DataFrame, DataSet

1.3.1. DataFrame: Load to schema(Untyped Object), type assign on execute

1.3.2. DataSet: Load to object (Typed Object), type assign on compile time

1.3.2.1. Note: Using DSL expression ($"col") instead of lambda to decrease cost of serialize and deserialize

1.3.3. Spark SQL: SQL Like query

1.3.3.1. Underly engine: Using Catalyst Optimization and Tungsten Project to optimize query plan

1.3.3.1.1. Analysis

1.3.3.1.2. Logical Optimization (Catalyst)

1.3.3.1.3. Physical Optimization (Tungsten)

1.3.3.1.4. Code generation

1.3.3.2. SQL Tables and Views

1.3.3.2.1. Managed Table

1.3.3.2.2. Unmanaged Tables

1.3.3.2.3. Global View

1.3.3.2.4. Session-scope View

1.3.4. Built-in Data source

1.3.4.1. paquet

1.3.4.2. json

1.3.4.3. csv

1.3.4.3.1. Reading into dataframe: val df = spark.read.format("csv") .schema(schema) .option("header", "true") .option("mode", "FAILFAST") .option("nullValue", "") .load(file)

1.3.4.3.2. Reading into Spark SQL table: CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING csv OPTIONS ( path "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*", header "true", inferSchema "true", mode "FAILFAST" )

1.3.4.3.3. Write DataFrame to file: df.write.format("csv").mode("overwrite").save("/tmp/data/csv/df_csv")

1.3.4.4. avro

1.3.4.5. image

1.3.4.6. binary

1.3.4.7. orc

1.3.5. External Data source

1.3.5.1. Postgres

1.3.5.1.1. Read: val jdbcDF1 = spark .read .format("jdbc") .option("url", "jdbc:postgresql [DBSERVER]") .option("dbtable", "[SCHEMA].[TABLENAME]") .option("user", "[USERNAME]") .option(

1.3.5.1.2. Write: jdbcDF1 .write .format("jdbc") .option("url", "jdbc:postgresql:[DBSERVER]") .option("dbtable", "[SCHEMA].[TABLENAME]") .option("user", "[USERNAME]") .option("password", "[PASSWORD]") .save()

1.3.5.2. MySQL

1.3.5.3. MS SQL Server

1.3.5.4. Casandra

1.3.5.5. MongoDB

1.3.5.6. Snowflake

1.3.6. Common DataFrames and Spark SQL Operations

1.3.6.1. Aggregate functions:

1.3.6.2. Collection functions

1.3.6.3. Datetime functions

1.3.6.4. Math functions

1.3.6.5. Miscellaneous functions

1.3.6.6. Non-aggregate functions

1.3.6.7. Sorting functions

1.3.6.8. String functions

1.4. MLlib

1.5. GrapX

1.6. Spark Streaming

1.6.1. Soure

1.6.1.1. File

1.6.1.2. Apache Kafka

1.6.1.3. Customize

1.6.2. Transform Data

1.6.2.1. Stateless

1.6.2.2. Stateful

1.6.3. Trigger

1.6.4. Sink

2. Application Concepts

2.1. Application

2.2. SparkSession (spark context)

2.3. Job: Spark driver convert application to one or more job and transform to DAG

2.4. Stage: As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel

2.5. Task: Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data

3. Operation

3.1. Transformation: Lazy evaluation

3.1.1. Narrow: No need shuffle for operation (filter, contains)

3.1.2. Wide: Need shuffle for operation (groupBy, orderBy ...)

3.2. Action: When action call, trigger evaluation to perform query

4. Architecture

4.1. Spark driver

4.1.1. Spark driver has multiple roles: it communicates with the cluster manager; it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the Spark operations into DAG computations

4.2. Cluster manager: responsible for managing and allocating resources

4.2.1. Buit-in Standalone

4.2.2. Mesos

4.2.3. Yarn

4.2.4. Kubernetes

4.3. Spark executor: executing tasks on the workers

4.4. Deployment mode

4.4.1. Local

4.4.1.1. Driver runs on a single JVM like a laptop or single node

4.4.1.2. Executor runs on the same JVM as the driver

4.4.1.3. Cluster runs on the same host

4.4.2. Standalone

4.4.2.1. Driver can run on any node in the cluster

4.4.2.2. Each node in the cluster will launch its own executor JVM

4.4.2.3. Cluster can be allocated arbitrarily to any host in the cluster

4.4.3. YARN (client)

4.4.3.1. Driver runs on a client, not part of the cluster

4.4.3.2. Executor run on YARN’s NodeManager’s container

4.4.3.3. Cluster manager: YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for executors

4.4.4. YARN (cluster)

4.4.4.1. Driver: Runs with the YARN Application Master

4.4.4.2. Executor: Same as YARN(client)

4.4.4.3. Cluster manager: Same as YARN(client)

4.4.5. Kubernetes

4.4.5.1. Driver: Runs in a Kubernetes pod

4.4.5.2. Executor: Each worker runs within its own pod

4.4.5.3. Cluster manager: Kubernetes Master