Lancez-Vous. C'est gratuit
ou s'inscrire avec votre adresse e-mail
SQL par Mind Map: SQL

1. Filter

1.1. reading

1.1.1. All (SELECT)

1.1.1.1. SELECT DISTINCT salary/12 AS monthly_wage FROM ...;

1.1.1.1.1. SELECT result FROM table WHERE selection

1.1.2. Parts (WHERE)

1.1.2.1. SELECT ... FROM ... WHERE column = result AND/OR/NOT column2 >< result2;

1.1.2.1.1. Searching in text

1.1.2.2. SELECT ... FROM t WHERE id IN (Subquery)

1.1.2.2.1. SELECT * FROM t WHERE AND id NOT In ( SELECT FROM WHERE )

2. Join Operations

2.1. return combined relations (FROM) to a new table or view

2.1.1. Inner

2.1.1.1. SELECT ... FROM course AS C , prereq AS P WHERE C.course_id = P.course_id;

2.1.1.1.1. Ignores cases where ids dont match!

2.1.2. Outer

2.1.2.1. Left (Null on right side)

2.1.2.1.1. SELECT ... FROM course AS C LEFT OUTER JOIN, prereq AS P ON C.course_id = P.course_id;

2.1.2.2. Right (Null on left side)

2.1.2.2.1. SELECT ... FROM course AS C RIGHT OUTER JOIN, prereq AS P ON C.course_id = P.course_id;

2.1.2.3. Full (Null)

2.1.2.3.1. SELECT ... FROM course AS C FULL OUTER JOIN, prereq AS P ON C.course_id = P.course_id;

3. Null Values

3.1. Find

3.1.1. SELECT... FROM ... WHERE x IS NULL;

3.1.1.1. Comparison with null returns unknown

3.1.1.1.1. Three valued logic or : positive perspective and : evaluates strictly not unknown: unknown

4. Aggregate Functions

4.1. Basics

4.1.1. SELECT AVG(salary) SELECT MAX(salary) SELECT SUM(salary) SELECT COUNT(DISTINCT id) FROM ...;

4.2. Group by

4.2.1. SELECT x, AVG(salary) AS avg_salary FROM ... GROUP BY x; ! Vars outside the agg function must be in the GROUP BY

4.3. AGG + WHERE SELECTION

4.3.1. WHERE doesnt work bc it is executed before the agg

4.3.1.1. SELECT x, AVG(salary) AS S FROM ... GROUP BY x HAVING S > 42000;

5. Nested Subqueries

5.1. From WHERE clause

5.1.1. For Set Membership For Set Comparison For Set Cardinality

5.1.1.1. Set Membership

5.1.1.1.1. Same as Basic > reading > WHERE > same table twice

5.2. From FROM clause

5.2.1. Temporary relation with WITH

5.2.1.1. WITH x AS ( SELECT MAX(x) FROM department ) SELECT deparment.name FROM department WITH department.budget = max;

5.3. From SELECT Clause

5.3.1. Returns single Values

5.3.1.1. SELECT dep_name, ( SELECT COUNT(*) FROM instructor WHERE xyz ) AS num_instructor FROM dept;

6. Set Operations

6.1. Combines independent Queries based on UNION (or), INTERSECT (and), EXCEPT (but not)

7. SQL Part 2

7.1. Data Definition Language (defines and modifies the structure of database objects)

7.1.1. Datatypes

7.1.1.1. Dates

7.1.1.1.1. Arithmetics

7.1.1.2. User defined types (dollars) CREATE TABLE ... ( dep_name VARCHAR(20), building VARCHAR, budget Dollars);

7.1.1.3. Domains (like type but with NOT NULL constraint) so that we dont have to check every entry as it is created

7.1.1.3.1. CREATE DOMAIN name CHAR(20) NOT NULL

7.1.1.4. Normal Datatypes

7.1.1.5. Large Object types

7.1.2. CREATE (Like constructor)

7.1.2.1. Vars

7.1.2.1.1. CREATE TABLE x ( id CHAR(5), salary NUMERIC(8,2));

7.1.2.2. Keys

7.1.2.2.1. Primary

7.1.2.2.2. Foreign

7.1.3. Modify : DROP and ALTER

7.1.3.1. DROP TABLE r (doesnt work if it still has references)

7.1.3.2. ALTER TABLE x ADD y (varchar(100)); domain

7.1.3.3. ALTER TABLE t DROP CONSTRAINT "xyz"

7.1.3.4. ALTER TABLE t MODIFY COLUMN "factor" DECIMAL (4,2) DEFAULT NULL; Modifies size factor from 9.99 to 10

7.1.3.4.1. ALTER TABLE CustomerOrder ADD COLUMN order_type ENUM('to go', 'eating in') NOT NULL DEFAULT 'eating in';

7.1.4. TRUNCATE drops and recreates table vs DELETE

7.1.4.1. CREATE VIEW geo AS ( SELECT * FROM t WHERE dep = "GEO"); GRANT SELECTION ON geo to geo_role

7.1.4.1.1. DELETE removes rows one by one and can be filtered with a WHERE clause. If where is omitted, i delete everything but keep the table structure

7.1.4.1.2. TRUNCATE removes all rows instantly no filter is faster cannot be rolled back

7.2. Data Manipulation Language (retrieve, insert, update and delete data in tables)

7.2.1. INSERT

7.2.1.1. INEXPLIZITE COLUMN NAMES INSERT INTO t VALUES (101, "y", 20); Values must match the numbers of columns in x assumes all columns are specified in the correct order

7.2.1.1.1. EXPLICIT COLUMN SELECTION INSERT INTO t (ID, name, age) VALUES (101, "y", 20) (FROM other table); More flexibility and reduces errors

7.2.2. UPDATE

7.2.2.1. UPDATE t SET salary = 8000 WHERE id = xyz;

7.2.2.1.1. Use CASE to avoid problems with linar updating

7.2.3. DELETE

7.2.3.1. DELETE FROM t WHERE dep = "finance"

7.2.3.1.1. DELETE FROM t WHERE salary < ( SELECT AVG(salary) FROM instructor );

7.3. Database Logic

7.3.1. Views Acts like a table but is not physically stored. Simplifies complex queries by hiding underlying logic. Provides security by restricting access to certain columns or rows. Can be used to present aggregated or calculated data.

7.3.1.1. Create

7.3.1.1.1. CREATE VIEW viewname AS (hier kommt nichts!) SELECT id, name, dep FROM instructor WHERE ;

7.3.1.2. Update

7.3.1.2.1. Possible if: From clause only one db relation Select clause contains only attribute names of the relation and has no agg

7.3.2. Prevent Data Errors

7.3.2.1. Declare NOT NULL

7.3.2.1.1. name VARCHAR(20) NOT NULL

7.3.2.2. Declare UNIQUE

7.3.2.3. CONSTRAINT ... CHECK

7.3.2.3.1. CREATE TABLE t ( ... CONSTRAINT "semester_fall" CHECK (semester IN ("Fall", "Spring");

7.3.3. Integrity

7.3.3.1. Cascading updates/deletions

7.3.3.1.1. If you delete X and Y has relation to it, Y also deleted CREATE TABLE t ( ...) FOREIGN KEY (dep) REFERENCES x ON DELETE CASCADE ON UPDATE CASCADE);

7.3.3.2. primary key constraints

7.3.3.2.1. Unique and non-null

7.3.4. Authorization

7.3.4.1. GRANT SELECT/INSERT/UPDATE/DELETE/ALL ON table or view TO user/PUBLIC

7.3.4.1.1. If you dont want to wite each user in the list use ROLES

7.3.4.2. REVOKE SELECT/INSERT/UPDATE/DELETE/ALL ON relation or view FROM user list

8. Entity Relationship Models -> From modelling to creating database

8.1. Design Process

8.1.1. Requirements Engeneering

8.1.1.1. What data and how much? Which types of entities and what relations? What attributes do they have?

8.1.1.1.1. Final Phase

8.2. Cardinality Constraints

8.2.1. in binary relationset:

8.2.1.1. 1:1 -> "at most 1" or 1...1 1:n (1+) n:1 n:m not all instances of set need to be mapped! So a 1:m can include 1 instructor taking no students

8.2.1.1.1. FInd cardinality by asking the other way around

8.3. ER Diagramm

8.3.1. cardinalities

8.3.1.1. - ->

8.3.1.2. complex

8.3.1.2.1. 1....* 1...1 0....*

8.3.2. Roles

8.3.2.1. see relationship sets.

8.3.3. Participation

8.3.3.1. total =

8.3.3.1.1. each entitiy in entitiy set in at least one relationship in relationship set

8.3.3.2. Partial -

8.3.3.2.1. not every instructor needs to have a student : some entitiys may not participate in relationship

8.3.4. Displaying attributes

8.3.4.1. composed attribute. Intendation

8.3.4.2. multivalued {}

8.3.4.3. derived ()

8.3.5. Higher Arity relationship sets

8.3.5.1. only one sigle arrow pointing to 3rd entitiy (interpretation issues)

8.3.6. Specialisaton: Inherit attributes from Superset

8.3.6.1. overlapping: both employee and student Disjoint: either or bot not both

8.3.6.1.1. Partial

8.3.6.1.2. Total

8.4. Reduction to Relation Schemas

8.4.1. Goal: transform ER sets uniformly into relation schemas How? for each ER set there is a unique relation. This relation is assigned a name of the corrisponding ER set

8.4.1.1. creates another table named after the relationset

8.4.1.1.1. M:M Relationship sets

8.4.1.1.2. 1:M

8.4.1.1.3. O:O

8.4.1.2. representing attributes

8.4.1.2.1. composite attributes

8.4.1.2.2. Multivalued

8.4.1.2.3. Derived

8.4.1.2.4. Weak entities

8.4.1.2.5. Higher arity relationships

8.4.1.2.6. specialisation

8.5. Design Decisions

8.5.1. ziehe multivalues als eigene relation raus (instru_phone)

8.5.1.1. Placement of primary key in 1:1 relationships

8.6. Comparison UML

9. Normalization asks: What is a good ER design? If we insert data, do we run into inconsistencies or anomalities? Implementing NF leads to: Consistency + Redundancy - Tables + complexity + Redundancy -> 3 Anomalities -> inconsistency Employee(EmpID, EmpName, DeptName, DeptLocation) If the value of C (e.g., department location) changes, you have to update it in many rows. If you miss one ⇒ inconsistent data You can’t insert some information unless you insert unrelated information. Example: You can't store the name of a department unless there's already an employee in it. If you delete the last employee in a department, you also lose information about the department itself.

9.1. 1 NF (PK) Problem: Math, Physics in one column. Inconsistency if Physics deleted

9.1.1. Solution/Concept: Atomicity (units are not divisible)

9.1.1.1. Def: Relational Schema is in NF if domains of all attributes are atomic no repeating groups or MultiValuedAttributes

9.1.1.1.1. SOlution: Create new relation with reference to all components of primary key of old relation

9.2. 2 NF (PK) Problem: partial functional dependency ON COMPOSITE KEY Enrollment= PK1 and PK2 StudentName depends only on PK2

9.2.1. Concept: Full Functional Dependencies if we know... we can determine ....

9.2.1.1. Def: all non-key attributes are functionally dependent on the entire primary key. or depends on a proper subset of the candidate key( violation: AB -> C but B->C). Means: I need really both columns of candidate Key to identify A If i have only one candidate key then it is already satisfied

9.2.1.1.1. Violation: one column is dependent on a minimal component of the composed PK

9.3. 3NF (PK) Problem : Transitivity leads to redundancy and the classic 3 anomalities (update, deletion, insertion)

9.3.1. Concept: Further Depenencies. Meaning: There is a second functional dependency (Transitivität) A-> B B-> C PK -> A1, A1 -> A2

9.3.1.1. Def: no attribute is transitivly depenendent on the PK

9.3.1.1.1. Solution: APPLY 1NF APPLY 2NF MIRROR tables based on dependency. B is FK of Tbale 1 and PK of table 2

9.4. Boyce-Codd NF (PK, CK)

9.4.1. Violation: A-> B Where A is non prime and B is prime (At least part of composite key)

9.4.1.1. stricter than 3NF

9.4.1.1.1. Example: If AB->a, then a -> B is a violation

9.5. 4 NF (PK, CK)

9.5.1. Concept: Multi-valued attributes. For one PK, there are multiple phone numbers (independent MVA)

9.5.1.1. Def: R in NF if R does not contain more than one Multivalued attribute

9.5.1.1.1. Solution: Seperate R for each multi-valued attribute. Identify PK

9.6. 5 NF (PK, CK)

9.6.1. Concept: Addtional constraints (DM only in 2nd semester). Dependent MVA

9.6.1.1. Problem: We cannot insert tuple with unknown instructor Null due to PK null constraint

9.6.1.1.1. Def: in NF if R cannot be decomposed and rejoined based on keys without removing or adding info

9.7. Domain Key NF

9.7.1. Def : is in NF if no insertion or deletion of anomalies possible in DB. All constraints are encoded in DB

9.7.2. Concept: Domain Constraint like students can only work 30h /week

9.7.2.1. Solution: New R with specifiic constraint. PK is the subgroup

9.8. Other suff

9.8.1. Incidential Denormalization

9.8.1.1. not normalizing is better for performance

10. Indexing and Hashing

10.1. Complexity Theory

10.1.1. Time and Memory Complexity

10.1.1.1. Its about relative Numbers and Scaling (not absolute numbers, this is hardware (N*t) becomes N)

10.1.1.1.1. Rules: Ignore Constants Highes Complexity dominates (limit argument) 0(1) is constant complexity that doesnt change

10.2. Indexing or Index Scan

10.2.1. From linary search (O(N) to binary search (O(N*log2N)

10.2.1.1. Index Files (without Index we dont have to go trough all N, its a search key and this speeds up lookup)

10.2.1.1.1. Search in index is O(log2N) and the following link is O(1)

10.2.1.2. Index files and joins

10.2.1.3. Ordered indices (good for ranges) entries are stored sorted (allows for O(log2N) search

10.2.1.3.1. example : B+ Tree

10.2.1.4. Hash indices (good for unique values like strings)

10.2.1.4.1. Hashing: Time complexity O(1) Compute storage location directly from hash-function Fast average access time for exact key lookups. Can degrade in worst-case scenarios (e.g., many hash collisions). Not suitable for range queries. Best for queries like: "Find record where ID = 100". Indexing (Ordered Index): Time complexity: O(log n) Look up value and retrieve storage location Slower on average than hashing but more stable and predictable performance. Supports both exact lookups and range queries. Useful for queries like: "Find all records where age is between 20 and 30". Maintains data in a sorted order (e.g., using B+ Trees).

10.2.1.5. Other types

10.2.1.5.1. Trees

10.2.1.6. Bitmap Indeces

10.2.1.6.1. Good for few attributes (unlike Has and index)

11. Database Architecture How is data stored?

11.1. Physical Storage and Storage Hierarchy

11.1.1. Storage Options: Hard Disk, SSD, Tape

11.1.2. Storage Hierarchy The closer to CPU the faster the read and write (speed of access) but volatile- if crash data is lost The further from CPU the lower the cost (why we dont put everything in CPU) therfore save if sytem fails

11.1.2.1. Primary

11.1.2.1.1. Cache, ram

11.1.2.2. Second

11.1.2.2.1. Flash Memory ( in USB sticks) Magnetic discs

11.1.2.3. Tertiary

11.1.2.3.1. CD-Rom

11.2. File Organization

11.2.1. Heap: Records stored anywhere

11.2.1.1. Sequential File: Ordered by search key

11.2.1.1.1. Hashing: via hasfunction

11.3. Storage Access and Buffer Manager

11.3.1. Buffer: Portion of main memory available to store copies of disc blocks Buffer manager: subsystem responsible for allocating buffer space in main memory

11.3.1.1. Placement Strategies

11.3.1.1.1. LRU

11.3.1.1.2. MRU

11.3.1.1.3. Toss imediate

11.4. DBS Architecture

11.4.1. Centralized Singel machine (Server system)

11.4.1.1. run on a single computer with no interaction with others

11.4.1.1.1. Has only few CPU

11.4.2. Client Server system

11.4.2.1. Split into two parts: Server: runs the DBMS engine, stores data, executes SQL queries. Clients: user applications that send SQL requests to the server.

11.4.2.1.1. are connected to server via network

11.4.3. Parallel Database Systems Speed up systems with parallelism across processors maany CPUs working on one DB (focus = speed).

11.4.3.1. multiple processiresl

11.4.3.1.1. sharred memory

11.4.3.1.2. shared disk

11.4.3.1.3. SHared nothing

11.4.3.1.4. HIerarchical

11.4.4. Distributed systems many independent DBs across locations, appear as one (focus = distribution & reliability).

11.4.4.1. Homo vs hetero Global vs local transaction Reprlication

11.4.4.1.1. Trade off: Datasharing, complexity of coordination, Autonomy

11.5. Performance Metrics How much performance do we gain by enlaring the system? Optimum: Doubling the system doubles the performance -> LINEARITY

11.5.1. Speedup: fixed size probem of a small system is given to a sytem N-times larger

11.5.1.1. Speedup: small system elapsed time/large system elapsed time

11.5.1.1.1. Linear if equation egals N WHere N is given in task (eg. # of processors) More CPU more speed (diagonal line in coordinate system)

11.5.2. Scaleup: Increase the size of both problem and system

11.5.2.1. Scale up: small system smal problem elapsed time/ big system big problem elapsed time

11.5.2.1.1. N times lareger system used to perform a n times larger task (bigger problemsize, same Time small/big verhältnis) Horizontal line in coordinate system) Linearity = 1

11.5.3. SUblinear if high starting cost, inferenc (processes access shared rescources (# of CPU)), Skew (overall execution time is determineded by the slowest parallely executing task)

11.5.4. Transaction scale up n times as many users submit regquests to a n times larger database on a n tines larger computer

11.5.4.1. Old processing time/ new processing time Linearity = 1

12. Query processing what does DB do if i Aggregate something? Which part of query is procesed first?

12.1. Measuring Query cost (not optimizing yet!)

12.1.1. QUery cost: measured by execution time. Many factors contribute to that (CPU, Network communication) Most dominant is Disc access

12.1.1.1. Disc sccess

12.1.1.1.1. DB are stored on Disc bc storing in CPU is faster but volatile and expensive

12.2. How costly is Selection?

12.2.1. Algorithm 1 Linear Search Ass: File is stored in consecutive blocksgi

12.2.1.1. Non-key Search all

12.2.1.1.1. SELECT * FROM R WHERE age = 999; (algo scans everything)

12.2.1.2. Key Search one and stopping after

12.2.1.2.1. SELECT * FROM Employees WHERE emp_id = 1050;

12.2.2. Algo 2 primary index, key

12.2.2.1. Primary index: An index built on a primary key or another attribute that: unique and sorted Equality on key: exact value of the indexed attribute, like emp_id = 1001. retrieves one record

12.2.2.1.1. SELECT * FROM Employees WHERE emp_id = 1001;

12.2.3. Algo 3 primary index, non-key

12.2.3.1. Equality on non-key Searching for records where a non-unique attribute (everything exept the ID) equals a given value, potentially returning multiple records.

12.2.3.1.1. SELECT * FROM Employees WHERE department = 'Sales'; An index exists on department, but since it's non-key, we expect multiple matches. in special case of clustering index

12.2.4. Algo 4 Secondary Index, non-key

12.2.4.1. A secondary index is: created on a non-primary key attribute (not unique, allows duplicates). The data file itself is not sorted by this attribute -> matching records canbe scattered randomly across the disk. Non-key: doesnt have unique values

12.2.4.1.1. SELECT * FROM Employees WHERE department = 'Sales';

12.2.5. Algo 5 primary index comparison

12.2.5.1. CREATE TABLE Employees ( emp_id INT PRIMARY KEY, age INT, salary INT ); Assuming we have a primary index on age, i.e., the data is sorted by age.

12.2.5.1.1. SELECT * FROM Employees WHERE age >= 30;

12.2.6. Algo 6 secondary index, comparison

12.2.6.1. Primary index CREATE TABLE Employees ( emp_id INT PRIMARY KEY, department VARCHAR(50), salary INT ); Secondary index CREATE INDEX idx_salary ON Employees(salary);

12.2.6.1.1. SELECT * FROM Employees WHERE salary >= 50000;

12.3. How costly is join?

12.3.1. Nested loop join

12.3.1.1. Worst case: Memory can only hold one block of each realtion -> Nested Loop Example (Outer)Instructor: 5000 records/ 100 blocks (Inner)Department: 10000 records/400 blocks Cost= (Ir * Db + Ib transfers )+(Ir +Ib seeks) Or

12.3.1.1.1. Best case: smaller relation fits intirely in memory. I can load smaller relation in memory and can go trough outer relation once

12.3.1.1.2. Outerr * Innerb + Outerb = Transfer Outerr + Outerb = Seeks

12.3.1.2. Block nested Loop join Block the inner/piter loop

12.3.1.2.1. worst case: only one block of each relation fits in memory Cost: Ib*Db+Ib transfers + 2* Ib seeks

12.3.1.2.2. Best case Ib+Db transfers + 2 seeks

12.3.1.3. Indexed Nested loop Join efficient if index exists

12.3.1.3.1. cost: (Ib transfers )+(Ib seeks) + (Ir *(5 block transfers and 5 seeks) why 5? department is in a b+ tree and has 10.000 tuples, the tree has hieght of 4. we need 5 block transfers to find actual data (via pointer)

12.3.2. Merge join efficient if data sorted assumption: relation is presorted All matching tuples fit in memory

12.3.2.1. (Ib + Db block transfers) + (round up( Ib/Db)+(round up (????)

12.3.2.1.1. block transfers = Outerb+ Innerb

12.3.3. Hash Joins for natural joins

12.3.3.1. block transfers = b_r + b_s

12.3.3.1.1. seeks ≈ b_r + b_s

13. Query Optimization

13.1. Eqivalence rules used for optimization ONLY APPLY DONT STUDY Equivient if produces the same set of tuples. order irrelevant

13.1.1. WHERE CLAUSE

13.1.1.1. **1. Split filters (Break combined filters into separate steps)** Problem: AND, ∧ σ_{Dept='IT' ∧ Salary>60000}(Employee) -> σ_{Salary>60000}(σ_{Dept='IT'}(Employee)) ------------------------------ SELECT * FROM Employee WHERE Dept = 'IT' AND Salary > 60000; -> SELECT * FROM (SELECT * FROM Employee WHERE Dept = 'IT') AS E WHERE Salary > 60000;

13.1.1.1.1. **2. Swap filters** Problem ? σ_{Dept='IT'}(σ_{Salary>60000}(Employee)) -> σ_{Salary>60000}(σ_{Dept='IT'}(Employee)) ___ _ _ ___ ___ ________ SELECT * FROM Employee WHERE Dept = 'IT' AND Salary > 60000; -> SELECT * FROM Employee WHERE Salary > 60000 AND Dept = 'IT';

13.1.2. SELECT CLAUSE

13.1.2.1. **3. Skip extra projections** Problem: Multiple π π_{name}(π_{name,salary}(Employee)) -> π_{name}(Employee) ----------------------- SELECT name FROM (SELECT name, salary FROM Employee) AS E; -> SELECT name FROM Employee;

13.1.3. JOIN

13.1.3.1. **7. SELECTION PULLUP** Problem: Big immediate results and select on that Solution:Reduces the numbers of rows early on σ_{Salary>60000}(Employee ⨝ Department) -> (σ_{Salary>60000}(Employee)) ⨝ Department a) Where selection only inclused A of one R b) Where selection 1 includes A of R1 and selection 2 includes A of R2 ---------------------------------- SELECT * FROM (SELECT * FROM Employee JOIN Department ON dept_id = id) AS T WHERE Salary > 60000; -> SELECT * FROM (SELECT * FROM Employee WHERE Salary > 60000) AS E JOIN Department ON E.dept_id = Department.id; Applies the selection on orders before the join

13.1.3.1.1. **8. PROJECTION PULL UP** Reduces columuns early / illiminates unneded attributes from the interpediate results πa,b (R⋈S)=πa,b (πa,x (R)⋈πb,y(S)) π_{name}((Employee ⨝ Department)) -> π_{name}((π_{name,dept_id}(Employee)) ⨝ (π_{id}(Department))) ----------------------- The products and categories tables both have wide schemas with 15+ columns, but this query only needs two fields from each. Also SELECT p.name, c.category_name FROM products p JOIN categories c ON ... -> SELECT p.name, c.category_name FROM ( SELECT name, category_id FROM products) p JOIN ( SELECT category_id, category_name FROM categories) c By projecting only the necessary columns before the join, the query processes smaller intermediate results. This can reduce memory and I/O usage, particularly when working with wide tables.

13.1.4. SIZE ***Do the most restictive join, selection (Smallest result size) first**

13.1.4.1. **5. Swap join order** (Just one Join) Theta Joins are commutative Problem: The departments table has 20 rows, while employees has 5000 rows. The two are joined on department_id. Solution: The smaller table (departments) on the left may improve performance, particularly in index nested-loop joins or hash joins where the smaller table is used to build a hash table A ⨝ B -> B ⨝ A ---------------- SELECT * FROM A JOIN B ON A.id = B.id; SELECT * FROM B JOIN A ON A.id = B.id

13.1.4.1.1. **6.Group joins differently ** (2+ Joins) **Beachte die econnection via the PK** Aufgabe sagt: Thecourses table has 200 rows, enrollments has 20,000 rows, and students has 5,000 rows. The tables are joined on course_id and student_id (A ⨝ B) ⨝ C -> A ⨝ (B ⨝ C) FROM students s JOIN enrollments e Wird zu FROM ( courses c JOIN enrollments e ...

13.1.5. Heuristics to optimize

13.1.5.1. Join Order (Rule 6a): Join smaller tables E1, E2 first and then with bigger E3 Join tables that are reated by ID, reduces intermediate table results with Cartesian product that dont have meaning. USED FOR SELECTION PUSHDOWN

13.1.5.1.1. Selection Pushdown (Rule 7): sleection to earlier steps = less rows Ist eigentlich nur KLammern verschieben σdept_name='Music'(instructor)

13.2. Cost calculation

13.2.1. We know # of input tables but not those of intermediate results. estiamte them to know how to optimize

13.2.1.1. **Distinct Values**

13.2.1.1.1. n = # of tuples b = # of blocks = reading relation takes so many page reads f = # of tuples fitting in block b = round up(n/f) V(Attribute, Relation) = # ofdistinct values V is usually given lol thank you

13.2.1.2. **Selection size**

13.2.1.2.1. c = # of matching tuples of selection

13.2.1.3. **Join SIze**

13.2.1.3.1. Cartesian Product (No Matching Attributes) # of tuples= nr ⋅ns

13.2.1.3.2. Inner Join Case 1: Join Attribute is Not a Key in Either Relation 1. Esimate for R # of tuples in R = nr*ns/V(A,s) V = distinct values in S 2. Estimate same for S 3. Select the minumum ​

13.2.1.3.3. Outer Join Left outer: # of tuples= size of r joined by s + size of r same for left outer Full outer join: tuples= size of join + size of r + size of s

13.2.1.4. **Size of projections and aggregation**

13.2.1.4.1. for both its the number of distinct values V(A, r)

13.3. Decorrelation: Nested subqueries are replaced by Query with a join

13.3.1. NO: Seect name from instructor where exists ( select * from teaches where instructor:id = teaches.ID and teaches.year = 2007)

13.3.1.1. Better: Select name From instructor, teaches Where instructor.ID = teaches.ID and teaches.year = 2007)

13.3.1.1.1. Problem with subqueries: subquery runs once for each row in outer loop , causing high I/O.

14. Transactions and Concurrency

14.1. BIg Problem: Inconsistency: The database’s integrity constraints (rules like primary keys, foreign keys, or domain restrictions) are violated. The values stored don’t match the real-world facts they’re supposed to represent. We cannot trust the DB anymore

14.1.1. Concurrency: Parallel executions of multiple transactions Concurrency **control** - **What:** ensures that multiple transactions can be executed simultaneously without leading to data inconsistency. -**Mechanism** It prevents conflicts by coordinating access to shared data items (locks) -**Goal** guarantees that the result of concurrent executions is equivalent to some serial execution

14.1.1.1. But we also want T to be concurred to be more efficient ressouce use, reduce wait times: One T uses CPU an the other is reading/writing to disc

14.1.1.1.1. **Tradeoff** concurrency (parallelism and performance) vs consistency

14.2. Transaction: Transaction = unit of execution that may read/update multiple data items.

14.2.1. Eg: Transaction to transfer 50 euro from account A to account B

14.2.1.1. Two sources of Problems: hardware failure and system crashes AND concurrency

14.2.1.1.1. **Reqirement for Transactions**: ACID

14.2.1.1.2. States: active: initial state. T stays active partitally comittes: final statement (read, write) is done comitted: sucessfull completion, notify user failed: discovers that normal execution can no longer proceed abort: T is rolled back and DB restored

14.3. Schedules= order in which instructions of concurrent transactions are executed

14.3.1. Scheduel Type: Serial – transactions run one after another. First T1 then T2. has commit at the end

14.3.2. Schedule Type: intertwined– T1 is interrupted by T2 but may still be consistent. commit at the end

14.3.2.1. concurrent schedule is serializable if outcome is equivalent to a serial schedule

14.3.2.1.1. if Instructiion are non-conflicting, we can swap the order from Schedule to schedule` (also called serial schedule). -> both Schedules are conflict equivalent and schedule is called **conflict serializable**. -> Meaning we can change the order of dependencies as long as dependences stay together (T2 after T1) across all dependencies -> Meaning, we can optain a serial schedule, meaning the intertwined schedule is consistent. -> meaning outcome of intertwined schedule is the same as serial execution

14.3.3. Recoverable Schedules

14.3.3.1. Problem: if T1 reads data written by another uncommitted transaction, there’s a risk that if the first one later aborts, the second one has already used bad data. (data that doesnt exist)

14.3.3.1.1. Goal: Make sure no committed transaction has ever used uncommitted data from another transaction.

14.3.3.2. Problem Cascading rollback – one abort forces others to abort. if T10 aborts, all the other Transactions reading and writing on Q need to be rolled back. We use alot of work done.

14.3.3.2.1. **Cascadeless schedules** Cascadeless – reads happen only after writers commit.

14.3.4. Concurrency Control in DBMS

14.3.4.1. are mechanisms that will ensur that Scheduel are both -conflict serializable (but then less parallelism) -Recoverable and preferably cascadeless

14.3.4.1.1. tradeoff between degree of parallelism and overhead

15. Recovery

15.1. Failure classification

15.1.1. 1.Transaction Failure: Logical errors (e.g., invalid input). System errors (e.g., deadlock). 2.System Crash: Power failure or hardware/software crash Fail-stop assumption: non-volatile storage intact. 3.Disk Failure: Physical damage, detected via checksums.

15.2. Storage Structure

15.2.1. Does it survive a system crash?

15.2.1.1. Volatile storage: RAM, cache (lost on crash). Non-volatile: disk, flash (survives crash, may fail and lose data). Stable storage: fictional, approximated by redundant copies.

15.2.1.1.1. Stable Storage Approximation Maintain multiple copies on separate disks and buildings to be protected against floods Problem: Failure during data transfer can still result in inconsistent copies Handle partial writes using: Sequential writes to copies. Detection & correction of inconsistencies (checksums, RAID).

15.3. Recovery and atomicity

15.3.1. how can we ensure atomicity despite failures? go back to buffering to understand

15.3.1.1. solution: Log based recovery machanism: First: Output info describing the modification to stable storage Second: Modifying the database

15.3.1.1.1. Meaning: we wait until T was successfu (commit) and then we write changes to disk and then we modify DB. This allows us to recover until this point

15.4. **Undo/redo**

15.4.1. Undo: restore old values. Used for transaction rollbacks during normal operation and during recovery in case of failure Redo: reapply new values. (no logging in this case) used for recovery in system falure

15.4.1.1. **recovery after failure**

15.4.1.1.1. T needs to be **undone** if log **doesnt contain commit** ** or abort** <T0 start> T0, A 1000, 950 T0, B, 2000, 250 After failure: Because no commit, og values are restored and logs are writen out

15.4.1.1.2. A Compensation Log Record (CLR): used in recovery to record the undoing of an operation during the recovery process.

15.5. recovery algo

15.5.1. After Failure: Redo Phase: Start from last checkpoint. Redo all actions from active or post-checkpoint transactions. Undo Phase: Rollback incomplete transactions.

15.5.1.1. Redo phase: Find last <checkpoint L> record, and set undo-list to L. Scan forward from above <checkpoint L> record: Whenever a record <Tᵢ, Xⱼ, V₁, V₂> is found: redo it by writing V₂ to Xⱼ. Whenever a log record <Tᵢ start> is found: add Tᵢ to undo-list. Whenever a log record <Tᵢ commit> or <Tᵢ abort> is found: remove Tᵢ from undo-list. Undo phase: Scan log backwards from end. When a log record <Tᵢ, Xⱼ, V₁, V₂> is found where Tᵢ is in undo-list: perform same actions as for transaction rollback: perform undo by writing V₁ to Xⱼ. write a log record <Tᵢ, Xⱼ, V₁>. When a log record <Tᵢ start> is found where Tᵢ is in undo-list: Write a log record <Tᵢ abort>. Remove Tᵢ from undo-list. Stop when undo-list is empty i.e., <Tᵢ start> has been found for every transaction in undo-list. After undo phase completes, normal transaction processing can commence.

15.5.2. Checkpoints: It flushes all (dirty) pages in the buffer pool to disk at that moment. flushes all logs from buffer to disk pages flushed at checkpoint may belong to uncommitted transactions checkpoint log entry includes a list of all active transactions at the time the checkpoint is taken.: (checkpoint {T5, T6}) Only **transactions active at the** **checkpoint or started** **after it need to be redone or** **undone** during recovery

15.5.2.1. **All log records in memory and dirty pages are flushed to stable storage during a checkpoint.**

15.5.2.1.1. makes recovery faster

15.6. Failure with loss of non-volatile (stable) Storage

15.6.1. Use dumping: Copy DB to stable storage. Recovery = restore dump + redo committed post-dump transactions.

15.7. Remote backup systems

15.7.1. Allow continued T processing after primary site failure. Detection: heartbeat messages. Takeover: backup becomes the new primary until the failed site takes over again Minimizes Risk

15.8. WAL Protocoll WALguarantees that all log entries for T1, including the commit record and the update ⟨T1,Y,old,new⟩, have been written to stable storage before the lock is released. If the system crashes before Y is written to disk, the log can be used to redo the update. Therefore, other transactions can safely access Y after the commit, even if the updated value has not yet reached disk

15.8.1. Guarantee Atomicity and Durability by making sure changes are logged before the database itself is updated.

15.8.1.1. Log records must be output to stable storage in the order they were created: consistent, sequential history

15.8.1.1.1. transaction may enter the commit state only after its <commit> log record has been flushed to stable storage: durability: once a transaction commits, its effects must survive a crash. Flushing the <commit> log record ensures recovery will redo its operations if necessary

15.8.2. Difference to checkpoints: A checkpoint forces: All in-memory log records to stable storage. All dirty pages (Pages with updates but no commit) to disk. This isn’t required by WAL for correctness, but it’s done to limit recovery time (so you don’t need to scan the entire log).

16. Applications

16.1. Security implications. Goal: dont get hacked

16.1.1. Use Java script because it is not connected with a server. is in save mode

16.1.1.1. SQL Injection. user input includes partial sql query. these inputs might evaluate to true and i might be able to lock into an account with that. Like 1 "equals" 1

16.1.1.1.1. Solution

16.1.1.2. cross site scripting

16.1.1.2.1. HTLM Code on one page executes an action on another page

16.1.1.3. Passwort leakage

16.1.1.3.1. Problem: Storing passworts in a Database or where it is accessible to users (in script from the webserver)

16.1.1.4. Application authentication

16.1.1.4.1. single factor authentication like only PW is risky bro. BC users tend to use one PW for many sites or spyware captures PW

16.1.1.5. Application level authentification

16.1.1.5.1. Fine grained authentification. F. e. User can only view his own row. I cannot see the grades of other students

16.1.1.6. Audit trails

16.1.1.6.1. Is a log to track who, what breached the security and what lead to error. To repair the demage caused. need to be at DB level and Application level