ETL Process

ETL vs EL & Apache

Kom i gang. Det er Gratis
eller tilmeld med din email adresse
ETL Process af Mind Map: ETL Process

1. SOURCES What is ETL with a clear example - Data Engineering Concepts https://www.youtube.com/watch?v=wDTzxdShbd8 What is ETL for Beginners | ETL Non-Technical Explanation https://www.youtube.com/watch?v=wyn-PkJB3Lk What is the difference between ETL and ELT? https://aws.amazon.com/de/compare/the-difference-between-etl-and-elt/ https://airflow.apache.org/docs/apache-airflow/1.10.10/start.html

2. Process: EXTRACT structured and unstructred data from a source into a buffer area TRANSFORM data by cleaning and organising to improve data quality LOAD data on DB on batches or at once (batch vs stream)

2.1. Helps avoid extracting, transforming and loading data everytime you need it so it is time efficient and accessible + insights cannot be extracted from transactional data

2.1.1. ETL

2.1.1.1. Consists of loading only the structured aggregated,transformed, data because storage is limited So only this historical data is available for analysis & reporting. *What if you need different data? Redo the process but difficult to change the automated rules of ETL that run periodically. So ELT is the alternative.

2.1.2. ELT

2.1.2.1. Loading all the raw data (structured + unstructured) into a DL via Hadoop ecosystem then transform the data depending on the need. ETL vs ELT? Depends on business agitily & type of data (volume, velocity, variety).

3. Apache Airflow

3.1. Open-source platform for developing, scheduling, and monitoring *batch*-based workflows (=sequence of tasks/process) - works in a cluster. Configured as Python code(= dynamic pipeline): Airflow evaluates the DAG script and executes the tasks and their dependencies in the defined order and at the set interval. From the interface, you can inspect logs and manage tasks, for example retrying a task in case of failure > While DAGs describe *how* to run a workflow, Operators determine *what* actually gets done by a task. > Airflow completes work based on the arguments you pass to your operators (=functions/commands) > You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow./