Knowledge about databases

Lancez-Vous. C'est gratuit
ou s'inscrire avec votre adresse e-mail
Knowledge about databases par Mind Map: Knowledge about databases

1. Database Warehouses

1.1. Questions

1.1.1. What is the difference between a database, a datawarehouse, a datalake? Databases: Oracle, Postgress SQL, MySql - What is the difference? Who are other players in this space? Datawarehouse: Snowflake, Databricks, - What is the difference? Who are other players in this space? Datalake: AWS, redshift, - What is the difference? Who are other players in this space? Databases: Oracle, Postgress SQL, MySql - What is the difference? Who are other players in this space? Datawarehouse: Snowflake, Databricks, - What is the difference? Who are other players in this space? Datalake: AWS, redshift, - What is the difference? Who are other players in this space?

1.2. •Relational database • Cloud • Data lake •File

1.3. What is a data warehouse?

1.3.1. • A unified place to keep an organization's data sets

1.3.2. • Data is typically well- structured and curated

1.3.3. • May contain discrete organizational units

1.3.4. Data Warehouse is like a relational database designed for analytical needs.

1.3.5. Here

1.3.6. Contains historical information which helps in analyzing business metrics.

1.3.7. Data warehouse is mainly used to read data.

1.3.8. End users are business analysts/ data analysts.

1.3.9. It is a central location where consolidated data from multiple locations (databases) are stored.

1.4. Snowflake says

1.5. Diagram

1.6. ACID - The old rules for how data should be stored

1.6.1. ACID Vs Base

1.6.1.1. ACID

1.6.1.1.1. A

1.6.1.1.2. C

1.6.1.1.3. I

1.6.1.1.4. D

1.6.1.1.5. Here

1.6.1.2. BASE

1.6.1.2.1. Here

2. A normal database

2.1. • Designed to capture and record data (OLTP)

2.2. • Live, real-time data

2.3. • Data Stored in tables with rows and columns

2.4. • Data is highly detailed

2.5. • Flexible Schema (how the data is organized)

2.6. Contains current information that is required to run the business.

2.7. Database is mainly used to write the data.

2.8. End users are ops team members.

3. Unsorted Knowledge

3.1. In the past, HDFS - Everyone would store in a data lake - cheap storage. You then Hadoop jobs or you run stream processing systems like spark, flink, apache beam, You'd have your low cost storage and you'd have your processing system. Spark has data bricks Spark was the winner in the next generation processing framework space Big Query and Snowflake have a tiered storaging system. - Even load it into memory. Tiered by the speed. They take care of all of the lower level plumbing out of your data infrastructure. Instead of needing a datalake- you just put everything into snowflake and dynamically change the speed required. Originally, people were doing this with Amazon Redshift. Snowflake is great for companies which aren't using Amazon services but provide great infrastructure.

4. Hashing

4.1. Why do we need Hashing?

4.1.1. Here, are the situations in the DBMS where you need to apply the Hashing method:

4.1.1.1. • For a huge database structure, it’s tough to search all the index values through all its level and then you need to reach the destination data block to get the desired data.

4.1.1.2. • Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using the shorter hashed key instead of using its original value.

4.1.1.3. • Hashing is an ideal method to calculate the direct location of a data record on the disk without using index structure.

4.1.1.4. • It is also a helpful technique for implementing dictionaries.

4.1.2. Important Terminologies in Hashing

4.1.2.1. • Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit Of Storage.

4.1.2.2. • Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table). This allows you to find the relationship between two tables.

4.1.2.3. • Hash function: A hash function, is a mapping function which maps all the set of search keys to the address where actual records are placed.

4.1.2.4. • Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available data block is used to enter the new record, instead of overwriting on the older record.

4.1.2.5. • Quadratic probing– It helps you to determine the new bucket address. It helps you to add Interval between probes by adding the consecutive output of quadratic polynomial to starting value given by the original computation.

4.1.2.6. • Hash index – It is an address of the data block. A hash function could be a simple mathematical function to even a complex mathematical function.

4.1.2.7. • Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the issues of has a collision.

4.1.2.8. • Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static has to function.

4.2. There are mainly two types of SQL hashing methods/techniques:

4.2.1. 1. Static Hashing

4.2.2. 2. Dynamic Hashing

5. Types of ETL

5.1. Apps & APIs

5.1.1. FastAPI - Web framework for building APIs in Python

5.1.1.1. FastAPI is a modern, high-performance web framework for building APIs with Python based on standard type hints. It has the following key features:

5.1.1.2. Fast to run: It offers very high performance, on par with NodeJS and Go, thanks to Starlette and pydantic.

5.1.1.3. Fast to code: It allows for significant increases in development speed.

5.1.1.4. Reduced number of bugs: It reduces the possibility for human-induced errors.

5.1.1.5. Intuitive: It offers great editor support, with completion everywhere and less time debugging.

5.1.1.6. Straightforward: It’s designed to be uncomplicated to use and learn, so you can spend less time reading documentation.

5.1.1.7. Short: It minimizes code duplication.

5.1.1.8. Robust: It provides production-ready code with automatic interactive documentation.

5.1.1.9. Standards-based: It’s based on the open standards for APIs, OpenAPI and JSON Schema.

5.1.2. Flask - Web Development

5.1.2.1. Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications. It began as a simple wrapper around Werkzeug and Jinja and has become one of the most popular Python web application frameworks.

5.1.2.2. Flask offers suggestions, but doesn't enforce any dependencies or project layout. It is up to the developer to choose the tools and libraries they want to use. There are many extensions provided by the community that make adding new functionality easy.

5.1.3. Dash & Streamlit - DS Web Framework

5.1.3.1. As always, “it depends” – but if you’re looking for a quick answer, you should probably use:

5.1.3.2. Dash if you already use Python for your analytics and you want to build production-ready data dashboards for a larger company.

5.1.3.3. Streamlit if you already use Python for your analytics and you want to get a prototype of your dashboard up and running as quickly as possible.

5.1.3.4. Shiny if you already use R for your analytics and you want to make the results more accessible to non-technical teams.

5.1.3.5. Jupyter if your team is very technical and doesn’t mind installing and running developer tools to view analytics.

5.1.3.6. Voila if you already have Jupyter Notebooks and you want to make them accessible to non-technical teams.

5.1.3.7. Flask if you want to build your own solution from the ground up.

5.1.3.8. Panel if you already have Jupyter Notebooks, and Voila is not flexible enough for your needs.

5.2. MLOps

5.2.1. Pycaret MLFlow Integration

5.2.1.1. PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.

5.2.1.2. Compared with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.

5.2.1.3. The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.

5.2.2. MLFlow - Machine Learning Lifecycle, Tracking, Deployment

5.2.2.1. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:

5.2.2.1.1. MLflow Tracking

5.2.2.1.2. MLflow Projects

5.2.2.1.3. MLflow Models

5.2.2.1.4. Model Registry

5.2.2.2. Advantages

5.2.2.2.1. WORKS WITH ANY ML LIBRARY, LANGUAGE & EXISTING CODE

5.2.2.2.2. RUNS THE SAME WAY IN ANY CLOUD

5.2.2.2.3. DESIGNED TO SCALE FROM 1 USER TO LARGE ORGS

5.2.2.2.4. Apache Spark

5.2.3. MetaFlow - Scalable AWS Jobs for Data Scientists

5.2.3.1. At a high level, the Metaflow interface allows us to wrap our favorite bits of Python—or R as of July 2020—in simple functions or decorators, and then execute them within the context of Metaflow. Metaflow thinks about its execution logic in the context of a directed acyclic graph (DAG), which is a sequence of steps with a clear direction and dependency structure.

5.3. Cloud

5.3.1. boto3 (AWS) - AWS Python SDK

5.3.1.1. The AWS SDK for Python (Boto3) provides a Python API for AWS infrastructure services. Using the SDK for Python, you can build applications on top of Amazon S3, Amazon EC2, Amazon DynamoDB, and more.

5.3.2. Google Cloud - GCP Python SDK

5.3.2.1. Same as AWS SDK but for Google

5.3.3. Azure - Azure Python SDK

5.3.3.1. The open-source Azure libraries for Python simplify provisioning, managing, and using Azure resources from Python application code.

5.3.3.2. The Azure libraries are how you communicate with Azure services from Python code that you run either locally or in the cloud. (Whether you can run Python code within the scope of a particular service depends on whether that service itself currently supports Python.)

5.3.3.3. The libraries support Python 3.6 or later, and it is also tested with PyPy 5.4+.

5.4. ETL & Automations

5.4.1. Airflow - Workflow Scheduling & Monitoring

5.4.1.1. Airflow workflows are configuration as code in Python. Developers can write code that instantiates pipeline dynamically.

5.4.1.2. You can easily define your own operators, executors, and extend the library to suit your needs.

5.4.1.3. Airflow is modular and can orchestrate any number of workers, making it highly scalable.

5.4.1.4. You can easily convert your existing workflow into a DAG due to its wide range of Operators.

5.4.1.5. Workflows can be triggered by specific schedule or any external events.

5.4.1.6. Airflow provides an analytical dashboard to help you optimize your workflow.

5.4.1.7. You can run multiple data pipelines at different schedule in one Airflow instance.

5.4.2. Luigi - Batch Job Tool, Scheduling, Monitoring

5.4.2.1. Luigi is a Python package that manages long-running batch processing, which is the automated running of data processing jobs on batches of items. Luigi allows you to define a data processing job as a set of dependent tasks. For example, task B depends on the output of task A. And task D depends on the output of task B and task C. Luigi automatically works out what tasks it needs to run to complete a requested job.

5.4.2.2. Overall Luigi provides a framework to develop and manage data processing pipelines. It was originally developed by Spotify, who use it to manage plumbing together collections of tasks that need to fetch and process data from a variety of sources. Within Luigi, developers at Spotify built functionality to help with their batch processing needs including handling of failures, the ability to automatically resolve dependencies between tasks, and visualization of task processing. Spotify uses Luigi to support batch processing jobs, including providing music recommendations to users, populating internal dashboards, and calculating lists of top songs.

5.4.3. Ansible - Deployment Aut

6. Data Science, OLAP, OLTP, Fact Tables and Dimension Tables

6.1. What is OLAP?

6.1.1. OLAP (Online Analytical Processing) is a flexible way to make complicated analysis of multidimensional data.

6.1.2. Data present in a Data Warehouse is accessed by running OLAP queries. DBS however, are queried by running OLTP (Online Transaction Processing) operations.

6.1.3. OLAP activities are performed by converting the multi-dimensional data in a Warehouse into an OLAP cube.

6.1.4. E.g.

6.1.5. OLAP is used to perform real-time analysis on multidimensional data in a Warehouse.

6.1.6. OLAP deals with De-normalized data.

6.2. What is OLTP?

6.2.1. OLTP stands for Online Transaction Processing.

6.2.2. OLTP queries are used to perform DDL operations on a Database.

6.2.3. OLTP systems use data stored in the form of two-dimensional tables, with rows and columns.

6.3. Dimension Table

6.3.1. Here

6.4. Fact Tables

6.4.1. Here

6.5. Additive, Semi-Additive and Non-Addidtive Facts

6.5.1. An Additive fact is a measure in a fact table that can be fully summed across any of the dimensions associated with it.

6.5.2. A semi-additive fact is a measure in a fact table that can be summed across some dimensions associated with it, but not all.

6.5.3. A non-additive fact is a measure in a fact table that cannot be summed across any of the dimensions associated with it.

7. Summary

7.1. For transactions - Database

7.2. For quick querying - Datawarehouse

7.3. For semi or unstructured data - Datalake.

8. Big Data

8.1. What makes something big data?

8.1.1. Volume

8.1.2. Velocity

8.1.3. Variety

8.2. Private

8.2.1. What Data Is Private?

8.2.1.1. Demographic data that indicates racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data for the purpose Of uniquely identifying a natural person, data concerning health, or data concerning a natural person's sex life or sexual Orientation, as well as web behavior and IP addresses

8.2.1.1.1. racial

8.2.1.1.2. ethnic

8.2.1.1.3. political opinions

8.2.1.1.4. religious

8.2.1.1.5. philosophical

8.2.1.1.6. trade union membership

8.2.1.1.7. genetic data

8.2.1.1.8. biometric data

8.2.1.1.9. data concerning health

8.2.1.1.10. data concerning a natural person's sex life

8.2.1.1.11. sexual Orientation,

8.2.1.1.12. web behavior and IP addresses

8.3. Types of Data

8.3.1. Nature of the Data

8.3.1.1. Structured Data

8.3.1.1.1. • Rows and columns

8.3.1.1.2. • Variables are predefined

8.3.1.1.3. • Spreadsheets

8.3.1.1.4. • Relational databases

8.3.1.2. Semi Structured Data

8.3.1.2.1. • Variables are marked with tags

8.3.1.2.2. • Tags can vary freely

8.3.1.2.3. • HTML

8.3.1.2.4. • XML

8.3.1.2.5. • JSON

8.3.1.3. Unstructred Data

8.3.1.3.1. • Variables and fields are not labelled or identified

8.3.1.3.2. • Free text

8.3.1.3.3. • Photos

8.3.1.3.4. • Video

8.3.1.3.5. • Audio

8.3.2. State of the Data

8.3.2.1. Batch Data

8.3.2.1.1. • Data is "at rest"

8.3.2.1.2. • Can analyze in as much detail as desired

8.3.2.1.3. • Can look for patterns, groups, predictions

8.3.2.2. Stream Data

8.3.2.2.1. • Data is constantly arriving

8.3.2.2.2. • May not keep old data

8.3.2.2.3. • Looking for changes over time and for anomalies

9. Data Lakes

9.1. Before Data Lakes

9.1.1. Before the data lake, data's origin and path was sorted by metadata management. In Metadata management, there was a heavy emphasis on lineage (the source) of the data. Data went through many hops to reach the sorting tools.

9.2. What is a Data Lake?

9.2.1. • Designed to do away with "data silos"

9.2.2. • Made the big data revolution possible

9.2.3. • Designed to capture raw data (structured, semi-structured, unstructured)

9.2.4. • Made for large amounts of data

9.2.5. • Used for ML and Al in its current state or for Analytics with processing

9.2.6. • Hold data that is structured, semi-structured, and unstructured

9.2.7. • Can organize and put into Databases or Data Warehouses

9.3. When Data Lakes go bad

9.3.1. But just as with data warehouses, there are some potential risks with data lakes, and that is you can end up with a data swamp.

9.3.2. If the data lake isn't well maintained, if mistakes in the data aren't fixed, if adequate metadata isn't provided, if multiple, but different, versions of the same data are coexisting, if formats aren't adapted to the software used, you run the risk of getting this swamp-like situation, which can be a dangerous place to get lost, and really becomes a place where data go to die.

9.3.3. And so people came to recognize there were some limitations to this approach as well, which leads us to the current preferred solution, which is the cloud.

10. Cloud

10.1. Types of Cloud

10.1.1. 1 Multicloud

10.1.1.1. Using several cloud storage and computing providers simultaneously

10.1.2. 2 Hybrid Cloud

10.1.2.1. Using both public cloud providers and a secure, private cloud

10.2. What is Cloud?

10.2.1. This is when you use things like Amazon Web Services, better known as AWS, or Microsoft Azure, or Google Cloud, among others. And also, because it's fully online, it's more accessible. And truthfully, it's more adaptable. You can get more space immediately. You can get more computing resources immediately as you go. And so there's some huge advantages to storing and working with big data in the cloud.

10.2.2. And second, there's cost containment. When things are infinitely expandable at any time, you may not be keeping track of the costs that you're racking up.

10.2.3. And third, there's the remote possibility that the services you use might become temporarily inaccessible, or they may disappear completely. And so you do run the risk of your data kind of dissipating, or not knowing what's happening there.

10.2.4. Because we're in the data cloud era, it's worth pointing out a couple of variations that have become particularly popular. One is the multicloud, also called polynimbus.

10.2.5. This is using several different cloud storage and computing providers simultaneously. So you might use both Microsoft Azure, and Google Cloud, and use some IBM cloud computing resources. And the point here is you actually can integrate them so they're in a single heterogeneous environment, so it's as though they were in a single thing. Now, that's one popular approach because it lets you use the strengths of each of these providers.

10.2.6. But, as with every other solution, there are some risks, and I'm going to call it data dissipation. The major risks of storing data in the cloud have to do, first, with security, where the organization may not be keeping track of who has access to the resources, or it might be susceptible to hacking.

11. Edge Computing

11.1. • Central storage and processing

11.2. • Connected devices at edges of network

11.3. • But devices can also store and process data

12. Enterprise Data Warehouses