Get Started. It's Free
or sign up with your email address
Spark by Mind Map: Spark

1. Component

1.1. Spark core

1.2. RDD: non-structural

1.3. SparkSQL, DataFrame, DataSet

1.3.1. DataFrame: Load to schema(Untyped Object), type assign on execute

1.3.2. DataSet: Load to object (Typed Object), type assign on compile time Note: Using DSL expression ($"col") instead of lambda to decrease cost of serialize and deserialize

1.3.3. Spark SQL: SQL Like query Underly engine: Using Catalyst Optimization and Tungsten Project to optimize query plan Analysis Logical Optimization (Catalyst) Physical Optimization (Tungsten) Code generation SQL Tables and Views Managed Table Unmanaged Tables Global View Session-scope View

1.3.4. Built-in Data source paquet json csv Reading into dataframe: val df ="csv") .schema(schema) .option("header", "true") .option("mode", "FAILFAST") .option("nullValue", "") .load(file) Reading into Spark SQL table: CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl USING csv OPTIONS ( path "/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*", header "true", inferSchema "true", mode "FAILFAST" ) Write DataFrame to file: df.write.format("csv").mode("overwrite").save("/tmp/data/csv/df_csv") avro image binary orc

1.3.5. External Data source Postgres Read: val jdbcDF1 = spark .read .format("jdbc") .option("url", "jdbc:postgresql [DBSERVER]") .option("dbtable", "[SCHEMA].[TABLENAME]") .option("user", "[USERNAME]") .option( Write: jdbcDF1 .write .format("jdbc") .option("url", "jdbc:postgresql:[DBSERVER]") .option("dbtable", "[SCHEMA].[TABLENAME]") .option("user", "[USERNAME]") .option("password", "[PASSWORD]") .save() MySQL MS SQL Server Casandra MongoDB Snowflake

1.3.6. Common DataFrames and Spark SQL Operations Aggregate functions: Collection functions Datetime functions Math functions Miscellaneous functions Non-aggregate functions Sorting functions String functions

1.4. MLlib

1.5. GrapX

1.6. Spark Streaming

1.6.1. Soure File Apache Kafka Customize

1.6.2. Transform Data Stateless Stateful

1.6.3. Trigger

1.6.4. Sink

2. Application Concepts

2.1. Application

2.2. SparkSession (spark context)

2.3. Job: Spark driver convert application to one or more job and transform to DAG

2.4. Stage: As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel

2.5. Task: Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data

3. Operation

3.1. Transformation: Lazy evaluation

3.1.1. Narrow: No need shuffle for operation (filter, contains)

3.1.2. Wide: Need shuffle for operation (groupBy, orderBy ...)

3.2. Action: When action call, trigger evaluation to perform query

4. Architecture

4.1. Spark driver

4.1.1. Spark driver has multiple roles: it communicates with the cluster manager; it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the Spark operations into DAG computations

4.2. Cluster manager: responsible for managing and allocating resources

4.2.1. Buit-in Standalone

4.2.2. Mesos

4.2.3. Yarn

4.2.4. Kubernetes

4.3. Spark executor: executing tasks on the workers

4.4. Deployment mode

4.4.1. Local Driver runs on a single JVM like a laptop or single node Executor runs on the same JVM as the driver Cluster runs on the same host

4.4.2. Standalone Driver can run on any node in the cluster Each node in the cluster will launch its own executor JVM Cluster can be allocated arbitrarily to any host in the cluster

4.4.3. YARN (client) Driver runs on a client, not part of the cluster Executor run on YARN’s NodeManager’s container Cluster manager: YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for executors

4.4.4. YARN (cluster) Driver: Runs with the YARN Application Master Executor: Same as YARN(client) Cluster manager: Same as YARN(client)

4.4.5. Kubernetes Driver: Runs in a Kubernetes pod Executor: Each worker runs within its own pod Cluster manager: Kubernetes Master