显示完整思维导图

Practical SQL

其他

Lam Tung Nguyen

关注

马上开始. 它是免费的哦

或注册使用您的电邮地址

类似的思维导图思维导图概述

Practical SQL 作者：Lam Tung Nguyen Mind Map: Practical SQL

1. CHAPTER 9: Inspecting and Modifying Data

1.1. Importing data on meat, poultry, and egg producers

1.2. interviewing the data set

1.2.1. Purpose

1.2.1.1. Discover details

1.2.1.2. Determine what data contains

1.2.1.3. Understand what questions can be answered

1.2.1.4. Assess suitability for analysis

1.2.2. Techniques

1.2.2.1. Aggregate queries

1.2.2.2. GROUP BY

1.2.2.3. HAVING clause

1.2.2.4. COUNT()

1.2.2.5. Filtering

1.2.3. Example Analysis

1.2.3.1. Dataset: meat_poultry_egg_inspect

1.2.3.2. Question: Are companies listed multiple times at same address?

1.2.4. Best Practices

1.2.4.1. Never make assumptions

1.2.4.2. Validate data before drawing conclusions

1.2.4.3. Investigate anomalies

1.2.5. Checking for Missing Values

1.2.5.1. Purpose

1.2.5.1.1. Verify data completeness

1.2.5.1.2. Identify missing values

1.2.5.1.3. Understand data quality issues

1.2.5.2. SQL Techniques

1.2.5.2.1. COUNT() function

1.2.5.2.2. GROUP BY clause

1.2.5.2.3. ORDER BY sorting

1.2.5.2.4. WHERE with IS NULL

1.2.5.2.5. Handling NULLs in sorting (NULLS FIRST/LAST)

1.2.5.3. Data Cleanup

1.2.5.3.1. Identifying missing values

1.2.5.3.2. Tracking data cleansing tasks

1.2.5.3.3. Source verification procedures

1.2.6. Checking for Inconsistent Data Values

1.2.6.1. Problem Description

1.2.6.1.1. Inconsistent data hampers analysis

1.2.6.1.2. Multiple spellings lead to inaccurate aggregation

1.2.6.1.3. Variations in names create data quality issues

1.2.6.2. SQL Techniques

1.2.6.2.1. GROUP BY clause

1.2.6.2.2. COUNT() function

1.2.6.2.3. ORDER BY to sort results for easy scanning

1.2.6.2.4. Visual inspection of unduplicated values

1.2.6.3. Solution Strategy

1.2.6.3.1. Standardize company names

1.2.6.3.2. Create list of items to fix

1.2.6.3.3. Ensure proper grouping for future aggregations

1.3. modifying tables, columns, and data

1.3.1. Modifying Database Elements

1.3.1.1. SQL Commands

1.3.1.2. Use Cases

1.3.1.3. Modify Process

1.3.2. When to Toss your data

1.3.2.1. Warning Signs

1.3.2.2. Steps to Evaluate

1.3.2.3. Making the Call

1.3.3. Modifying Tables with ALTER TABLE

1.3.3.1. Add Column

1.3.3.2. Drop Column

1.3.3.3. Change Data Type

1.3.3.4. Add NOT NULL

1.3.3.5. Remove NOT NULL

1.4. Modifying Values with UPDATE

1.4.1. Basic Syntax

1.4.2. Multiple Columns

1.4.3. Targeted Updates

1.4.4. Cross-Table Updates

1.5. Creating Backup Tables

1.5.1. Purpose

1.5.1.1. Reference for future use

1.5.1.2. Protection against accidental data loss

1.5.1.3. Safety measure before making table modifications

1.5.2. Syntax

1.5.3. Verification

1.5.3.1. Compare row counts

1.5.3.2. Use single query to check both tables

1.5.4. Limitations

1.5.4.1. Indexes not copied

1.5.4.2. Need to create new indexes on backup tables

1.5.4.3. Consider performance impacts if querying backup table

1.6. Restoring Missing Column Values

1.6.1. Creating a Column Copy

1.6.1.1. SQL Operations

1.6.1.1.1. Add Column

1.6.1.1.2. Copy data

1.6.1.2. Verification process

1.6.1.2.1. Query both columns

1.6.1.2.2. Verify results

1.6.1.3. Benefits

1.6.1.3.1. Extra caution beyond table backup

1.6.1.3.2. Maintains original data in same table

1.6.1.3.3. Allow safe updates to original column

1.6.1.3.4. Easy recovery if updates cause problems

1.6.1.3.5. Parient, cautions approach

1.6.1.4. Usage context

1.6.1.4.1. Used with table that already has a full backup

1.6.1.4.2. Specifically column containing state codes

1.6.1.4.3. Preparation for updating missing data

1.6.2. Updating Rows Where Values Are Missing

1.6.2.1. Problem identification

1.6.2.1.1. Missing state codes in 'st' column

1.6.2.1.2. Three rows affected

1.6.2.1.3. Need complete state data for counts

1.6.2.2. Research process

1.6.2.2.1. Online search for missing information

1.6.2.2.2. Find state for each establishment

1.6.2.3. SQL solution

1.6.2.3.1. UPDATE with WHERE clause

1.6.2.4. Verification

1.6.2.4.1. Rerun query to find NULL values

1.6.3. Restoring Original Values

1.6.3.1. Why restore

1.6.3.1.1. Incorrect values provided

1.6.3.1.2. Wrong rows updated

1.6.3.1.3. Need to undo changes

1.6.3.1.4. Data safety measure

1.6.3.2. Backup options

1.6.3.2.1. In table back column

1.6.3.2.2. Full table backup

1.6.3.3. Restoration methods

1.6.3.4. Implement details

1.6.3.4.1. Both methods use UPDATE statements

1.6.3.4.2. Method 1 is simpler

1.6.3.4.3. Method 2 joins tables using primary key

1.6.4. Updating Values for Consistency

1.6.4.1. Problem identification

1.6.4.1.1. Inconsistent company name spellings

1.6.4.1.2. Multiple variations for same company

1.6.4.2. Data protection strategy

1.6.4.2.1. Create new column for standardized values

1.6.4.2.2. Preserve original data in original column

1.6.4.2.3. Work with copy to avoid data loss

1.6.4.3. Standardization process

1.6.4.3.1. Use UPDATE with WHERE clause

1.6.4.3.2. Use LIKE with wlldcard pattern matching

1.6.4.3.3. Set all matches to standardized name format

1.6.4.4. Verification

1.6.4.4.1. SELECT query to compare original and standardized values

1.6.4.4.2. Confirm all variations now standardized

1.6.4.4.3. Keep original column for reference

1.6.5. Repairing ZIP Codes Using Concatenation

1.6.5.1. Problemm identification

1.6.5.1.1. Missing leading zeros in ZIP codes

1.6.5.2. Data safety approach

1.6.5.2.1. Create backup colummn

1.6.5.2.2. Copy all existing ZIP values

1.6.5.2.3. Work with original and backup

1.6.5.3. Concatenation technique

1.6.5.3.1. Double-pipe operator (||)

1.6.5.3.2. Combines strings

1.6.5.4. Implementation steps

1.6.5.4.1. Backup ZIP data

1.6.5.4.2. Fix codes needing

1.6.5.4.3. Verification

1.6.5.4.4. Additional applications

1.6.6. Updating Values Across Tables

1.6.6.1. Concept

1.6.6.1.1. Use data from one table to update onather

1.6.6.1.2. Leverages table relationships

1.6.6.1.3. Requires matching keys between tables

1.6.6.2. Table structure

1.6.6.2.1. Main table

1.6.6.2.2. Region table

1.6.6.3. Update technique

1.6.6.3.1. UPDATE with table alias

1.6.6.3.2. WHERE EXISTS with subquery

1.6.6.3.3. Join table via matching columns

1.6.6.3.4. Filter by region value

1.7. Deleting unnecessary data

1.7.1. Deleting Rows from a Table

1.7.1.1. DELETE FROM

1.7.1.1.1. Removes all rows from a table

1.7.1.1.2. Syntax: DELETE FROM table_name

1.7.1.1.3. No WHERE clause needed

1.7.1.2. DELETE WITH WHERE

1.7.1.2.1. Removes specific fows

1.7.1.2.2. Requires WHERE clause with filtering expression

1.7.1.2.3. Syntax: DELETE FROM table_name WHERE expression

1.7.1.2.4. Can use comparision operators (=, <>, IN)

1.7.1.2.5. Can filter with multiple conditions using AND or OR

1.7.1.3. Alternative methods

1.7.1.3.1. DROP TABLE followed by CREATE TABLE

1.7.1.3.2. Faster for tables with large number of rows

1.7.1.3.3. Recreates empty table structure

1.7.2. Deleting a Column from a Table

1.7.2.1. Syntax

1.7.2.1.1. ALTER TABLE table_name DROP COLUMN column_name

1.7.2.1.2. Requires table name and column name

1.7.2.1.3. Table must exist in database

1.7.2.1.4. Column must exist in table

1.7.2.2. Use case example

1.7.2.2.1. Removing backup columns after data fixes

1.7.2.3. Effects

1.7.2.3.1. Removes column completely from table

1.7.2.3.2. Deleted all data in that column

1.7.2.3.3. Cannot be undone without backup

1.7.2.3.4. Table structure is modified

1.7.3. Deleting a Table from a Database

1.7.3.1. Syntax

1.7.3.1.1. DROP TABLE table_name

1.7.3.1.2. Simple ANSI SQL statement

1.7.3.1.3. Only requires table name

1.7.3.2. Use cases

1.7.3.2.1. Remove backup tables

1.7.3.2.2. Delete working tables that are no longer needed

1.7.3.3. Effects

1.7.3.3.1. Completely removes table from database

1.7.3.3.2. Deletes all data in the table

1.7.3.3.3. Cannot be undone without backup

1.7.3.3.4. Removes table structure and schema

1.8. Using transaction blocks to save or revert changes

1.8.1. Core concept

1.8.1.1. Group of SQL statements treated as a unit

1.8.1.2. Change not final until committed

1.8.1.3. Can be canceled before commiting

1.8.2. Transaction workflow

1.8.2.1. Start transaction

1.8.2.2. Execute SQL statements

1.8.2.3. Review results

1.8.2.4. Decide to commit or rollback

1.8.3. Keywords

1.8.3.1. TRANSACTION

1.8.3.2. COMMIT

1.8.3.3. ROLLBACK

1.9. Improving performance when updating large tables

1.9.1. Problem with row updates

1.9.1.1. PostgreSQL creates new row versions for each update

1.9.1.2. Old row versions not deleted immediately

1.9.1.3. Result in table inflation

1.9.1.4. Increases disk usage

1.9.1.5. Time consuming for large tables

1.9.2. Copy-and-swap technique

1.9.2.1. Create a copy of the table with the new column already populated

1.9.2.2. Swap the table names so the new copy become the main table

1.9.2.3. Original becomes a backup

1.9.3. Benefits

1.9.3.1. Avoids table inflation

1.9.3.2. Reduces disk space usage

1.9.3.3. Fast for table with hundreds of thousands or million of rows

1.9.3.4. Eliminates need for immediate VACUUM cleanup

1.9.3.5. Clean solution with less overhead

2. CHAPTER 10: Statistical Functions in SQL

2.1. Creating a census stats table

2.1.1. Data structure

2.1.1.1. Table name: asc_2011_2015_stats

2.1.2. Data source

2.1.2.1. US census bureau

2.1.2.2. American commmunity survey

2.1.2.3. 5-year estimates

2.1.3. SQL creation code

2.1.3.1. CREATE TABLE statment

2.1.3.2. Column definitions

2.1.3.3. Constraints setup

2.1.3.4. CHECK constraint

2.1.4. Data import process

2.1.5. Measuring Correlation with corr(Y, X)

2.1.5.1. Pearson correlation coefficient

2.1.5.1.1. Measures strength of linear relationship

2.1.5.1.2. Values range from -1 to +1

2.1.5.1.3. Denoted as r

2.1.5.1.4. Show how change in one variable relates to change in another

2.1.5.2. Interpretation of values

2.1.5.2.1. r = 0 No relationship

2.1.5.2.2. r = +/-0.1 to +/-0.29 weak relationship

2.1.5.2.3. r = +/-0.3 to +/-0.59 Moderate relationship

2.1.5.2.4. r = +/-0.6 to +/-0.99 Strong to nearly perfect relationship

2.1.5.2.5. r = +/-1 Perfect relationship

2.1.5.3. SQL function implementation

2.1.5.3.1. Function corr (X, Y)

2.1.5.3.2. Binary aggregate function

2.1.5.3.3. Y = dependent variable

2.1.5.3.4. X = independent variable

2.1.5.3.5. Order of parameters doesn't matter for results

2.1.6. Checking Additional Correlations

2.1.6.1. SQL techniques

2.1.6.1.1. Using corr (Y,X) function

2.1.6.1.2. Rouding results using round()

2.1.6.1.3. Type casting to numeric with ::

2.1.6.1.4. Single query for multiple correlations

2.1.6.2. Correclation analysis

2.1.6.2.1. Multiple correlation tests

2.1.6.2.2. Comparing variable pairs

2.1.6.2.3. Education, income commute times

2.1.6.3. Variable pairs & results

2.1.6.3.1. 0.68 Strong

2.1.6.3.2. 0.05 Very weak

2.1.6.3.3. -0.14 Weak, inverse

2.1.6.4. Key findings & caveats

2.1.7. Predicting Values with Regression Analysis

2.1.7.1. Linear regression concept

2.1.7.1.1. Finding the best linear equation

2.1.7.1.2. Creating a best fit straight line

2.1.7.1.3. Least squares regression line

2.1.7.1.4. Describes relationship between variables

2.1.7.2. Regression line formula

2.1.7.2.1. Y = bX + a

2.1.7.3. SQL implementation

2.1.7.3.1. regr_slope (X, Y)

2.1.7.3.2. regr_intercept (X, Y)

2.1.7.3.3. round() function for readability

2.1.7.3.4. ::numeric for type conversion

2.1.8. Finding the Effect of an Independent Variable with r-squared

2.1.8.1. R-squared concept

2.1.8.1.1. Coefficient of determination

2.1.8.1.2. Measures percentage of variation explained by the independent variable

2.1.8.1.3. Values range from 0 to 1

2.1.8.1.4. Example

2.1.8.2. Calculation methods

2.1.8.2.1. Square the correclation coefficient (r² = r x r)

2.1.8.2.2. Value alway between 0 and 1

2.1.8.2.3. Higher values mean stronger explanation power

2.1.8.3. SQL implementation

2.1.8.4. Limitations & caveats

2.1.8.4.1. Correclation doesn't prove causality

2.2. Creating rankings with sql

2.2.1. Ranking with rank() and dense_rank()

2.2.1.1. Function concepts

2.2.1.1.1. Window functions

2.2.1.1.2. Calculate acress sets of rows

2.2.1.1.3. Return results for each row

2.2.1.1.4. Unlike aggregate functions, don't group rows

2.2.1.1.5. Part of standard ANSI SQL

2.2.1.2. Syntax & implementation

2.2.1.2.1. Function syxtax: rank() OVER (ORDER BY column [DESC/ASC])

2.2.1.2.2. dense_rank() OVER (ORDER BY column [DESC/ASC])

2.2.1.2.3. Both use OVER clause to define

2.2.1.2.4. ORDER BY specifices sorting direction

2.2.1.3. Differences in handling ties

2.2.1.3.1. rank(): Creates gaps in rankings after ties

2.2.1.3.2. No gaps in rankings after ties

2.2.2. Ranking Within Subgroups with PARTITION BY

2.2.2.1. Concept overview

2.2.2.1.1. Window functions with grouping

2.2.2.1.2. Rank rows within specific groups

2.2.2.1.3. Create separate rankings for each group

2.2.2.1.4. Different from overall ranking

2.2.2.2. Syntax & Implementation

2.2.2.2.1. OVER clause extended with PARTITION BY

2.2.2.2.2. Syntax: rank() OVER (PARTITION BY column ORDER BY column [DESC/ASC])

2.2.2.2.3. Group rows by values in the partition column

2.2.2.3. Use cases & applications

2.2.2.3.1. Rank employees by salary within departments

2.2.2.3.2. Rank movies by earning within each genre

2.2.2.3.3. Find top vehicle for each auto manufacturer

2.2.2.3.4. Identify wettest month for each year

2.3. Calculating rates for meaningful comparisons

2.3.1. Why raw number can be misleading

2.3.1.1. Missing context

2.3.1.2. Population differences

2.3.1.3. Raw comparision is not meaningful

2.3.2. Rate calculation method

2.3.2.1. Formula: (count/population * 1000)

2.3.2.2. Result: Rate per 1000 people

2.3.3. Limittation & considerrations

2.3.3.1. FBI discourages creating rankings

2.3.3.2. Misleading perceptions can harm communities

2.3.3.3. Variations die to may factors

2.3.3.4. Incomplete reporting by agencies

2.3.3.5. Despite limitations still useful for comparision

3. CHAPTER 11: Working with Dates and Times

3.1. working with time zones

3.1.1. Finding Your Time Zone Setting

3.1.1.1. Default time zone setting

3.1.1.2. Listing time zones

3.1.1.3. Filtering time zones

3.1.1.4. Time zone considerations

3.1.2. Setting the Time Zone

3.1.2.1. Configuration

3.1.2.1.1. Server: postgresql.conf (permanent)

3.1.2.1.2. Client: SET timezone TO 'timezone' (session-only)

3.1.2.2. Data Handling

3.1.2.2.1. Stored as UTC internally

3.1.2.2.2. Display varies by session timezone

3.1.2.2.3. Original data remains unchanged

3.1.2.3. Conversion

3.1.2.3.1. Change session: SET timezone TO 'US/Eastern'

3.1.2.3.2. Per query: SELECT time AT TIME ZONE 'Asia/Seoul'

3.1.2.3.3. Example: 4am Pacific = 7am Eastern = 9pm Seoul

3.1.2.4. Benefits

3.1.2.4.1. Accurate calculations across regions

3.1.2.4.2. Consistent timestamp handling

3.2. Data types and Functions for dates and times

3.2.1. Manipulating dates and times

3.2.1.1. Extracting the Components of a timestamp Value

3.2.1.1.1. data_part() function

3.2.1.1.2. Calculated components

3.2.1.1.3. Alternative methods

3.2.1.2. Creating Datetime Values from timestamp Components

3.2.1.2.1. Common use cases

3.2.1.2.2. PostgreSQL functions

3.2.1.2.3. Input types

3.2.1.2.4. Time zone consideration

3.2.1.3. Retrieving the Current Date and Time

3.2.1.3.1. Standard SQL functions

3.2.1.3.2. PostgreSQL functions

3.3. Calculations with dates and times

4. CHAPTER 12: Advanced Query Techniques

4.1. Using subqueries

4.1.1. Filtering with Subqueries in a WHERE Clause

4.1.1.1. Purpose

4.1.1.1.1. Generate filter values dynamically

4.1.1.1.2. Avoid hardcoding comparison values

4.1.1.1.3. Combine multiple queries into one

4.1.1.1.4. Filter by calculated values

4.1.1.2. Implementation

4.1.1.2.1. Enclosed in parentheses after comparison operators

4.1.1.2.2. Example: WHERE column >= (SELECT ...)

4.1.1.2.3. Acts as a value generator for comparison

4.1.1.2.4. Can return single value or list of values

4.1.1.3. Practical Uses

4.1.1.3.1. Population analysis: WHERE population >= (SELECT percentile_cont(.9)...)

4.1.1.3.2. Table optimization: Delete rows based on subquery criteria

4.1.1.3.3. Working with subsets of large datasets

4.1.1.3.4. Dynamic filtering based on dataset characteristics

4.1.2. Creating Derived Tables with Subqueries

4.1.2.1. Definition

4.1.2.1.1. Subquery in FROM clause

4.1.2.1.2. Returns rows and columns as a table

4.1.2.1.3. Must be named with alias

4.1.2.2. Purpose

4.1.2.2.1. Perform multi-step operations in one query

4.1.2.2.2. Enable operations impossible in single query

4.1.2.2.3. Create intermediate results for further analysis

4.1.2.2.4. Process data before final selection

4.1.3. Joining Derived Tables

4.1.3.1. Definition

4.1.3.1.1. Multiple subqueries in FROM clause

4.1.3.1.2. Each subquery produces a derived table

4.1.3.1.3. Tables connected via JOIN statements

4.1.3.2. Purpose

4.1.3.2.1. Perform complex multi-step data processing

4.1.3.2.2. Preprocess data from different sources

4.1.3.2.3. Create intermediate aggregations

4.1.3.2.4. Calculate rates from preprocessed data

4.1.3.3. Structure

4.1.3.3.1. Multiple derived tables: (SELECT...) AS alias

4.1.3.3.2. Join condition: ON table1.column = table2.column

4.1.3.3.3. Final calculations in main query SELECT

4.1.4. Generating Columns with Subqueries

4.1.4.1. Definition

4.1.4.1.1. Subquery placed in SELECT clause column list

4.1.4.1.2. Returns single values for each row

4.1.4.1.3. Creates calculated columns dynamically

4.1.4.2. Implementation

4.1.4.2.1. Format: SELECT column1, (SELECT...) AS new_column

4.1.4.2.2. Typically uses aggregate functions in subquery

4.1.4.2.3. Can be used in calculations with other columns

4.1.4.2.4. Can appear multiple times in same query

4.1.4.3. Use Cases

4.1.4.3.1. Add reference values to each row (e.g., median)

4.1.4.3.2. Calculate differences from aggregates

4.1.4.3.3. Create comparison metrics

4.1.4.3.4. Generate derived statistics

4.2. common table expressions

4.2.1. Definition

4.2.1.1. Temporary named result sets

4.2.1.2. Created with WITH clause

4.2.1.3. Also called "WITH clause"

4.2.1.4. Recent addition to SQL standard

4.2.2. Syntax

4.2.2.1. WITH table_name (columns) AS (subquery)

4.2.2.2. Column definitions are optional

4.2.2.3. No data type definitions needed

4.2.2.4. Multiple CTEs can be defined at once

4.2.3. Use Cases

4.2.3.1. Simplifying complex queries

4.2.3.2. Replacing repetitive subqueries

4.2.3.3. Joining preprocessed data sets

4.2.3.4. Creating multiple temporary tables

4.3. cross tabulations

4.4. reclassifying values with case

4.5. Advanced Query Techniques 209 using case in a common table expression

5. CHAPTER 13: Mining Text to Find Meaningful Data

5.1. Formatting text using string Functions

5.1.1. Case Formatting

5.1.1.1. upper

5.1.1.1.1. Capitalizes all alphabetical characters

5.1.1.1.2. Example: upper('Neal7') → 'NEAL7'

5.1.1.2. lower

5.1.1.2.1. Converts all alphabetical characters to lowercase

5.1.1.2.2. Example: lower('Randy') → 'randy'

5.1.1.3. initcap

5.1.1.3.1. Capitalizes first letter of each word

5.1.1.3.2. Example: initcap('at the end of the day') → 'At The End Of The Day'

5.1.2. Character Information

5.1.2.1. char_length(string)

5.1.2.1.1. Counts total characters in string

5.1.2.1.2. Includes spaces in count

5.1.2.1.3. Example: char_length(' Pat ') → 5

5.1.2.2. length(string)

5.1.2.2.1. Alternative to char_length()

5.1.2.2.2. Non-ANSI SQL function

5.1.2.2.3. Has variant for binary strings

5.1.2.3. position(substring in string)

5.1.2.3.1. Finds location of substring in string

5.1.2.3.2. Returns index position (starting at 1)

5.1.2.3.3. Example: position(', ' in 'Tan, Bella') → 4

5.1.3. Removing Characters

5.1.3.1. trim()

5.1.3.1.1. Removes specified characters from strings

5.1.3.1.2. Default: removes spaces if no character specified

5.1.3.1.3. Example: trim('s' from 'socks') → 'ock'

5.1.3.2. Position Options

5.1.3.2.1. Leading (front of string)

5.1.3.2.2. Trailing (end of string)

5.1.3.2.3. Both (default, removes from both ends)

5.1.3.2.4. Example: trim(trailing 's' from 'socks') → 'sock'

5.1.3.3. PostgreSQL Variants

5.1.3.3.1. ltrim(string, characters)

5.1.3.3.2. rtrim(string, characters)

5.1.4. Extracting and Replacing Characters

5.1.4.1. left(string, number)

5.1.4.1.1. Returns specified number of characters from left side

5.1.4.1.2. Example: left('703-555-1212', 3) → '703'

5.1.4.2. right(string, number)

5.1.4.2.1. Returns specified number of characters from right side

5.1.4.2.2. Example: right('703-555-1212', 8) → '555-1212'

5.1.4.3. replace(string, from, to)

5.1.4.3.1. Substitutes characters in a string

5.1.4.3.2. Replaces all instances of 'from' with 'to'

5.1.4.3.3. Example: replace('bat', 'b', 'c') → 'cat'

5.2. Matching text patterns with regular expressions

5.2.1. Regular Expression Notation

5.2.1.1. Basic Elements

5.2.1.1.1. Literals: Match exact characters (e.g., "Al" matches in "Alicia")

5.2.1.1.2. Wildcard (.): Matches any single character except newline

5.2.1.1.3. Character Classes []: Match any character within brackets

5.2.1.2. Special Character Shortcuts

5.2.1.2.1. \w: Any word character [A-Za-z0-9_]

5.2.1.2.2. \d: Any digit

5.2.1.2.3. \s: Any whitespace

5.2.1.2.4. \t: Tab character

5.2.1.2.5. \n: Newline character

5.2.1.2.6. \r: Carriage return character

5.2.1.3. Position Anchors

5.2.1.3.1. ^: Match at start of string

5.2.1.3.2. $: Match at end of string

5.2.1.4. Quantifiers

5.2.1.4.1. ?: Preceding match zero or one time

5.2.1.4.2. *: Preceding match zero or more times

5.2.1.4.3. +: Preceding match one or more times

5.2.1.4.4. {m}: Exactly m times

5.2.1.4.5. {m,n}: Between m and n times

5.2.1.5. Grouping & Alternation

5.2.1.5.1. (pattern): Capture group - reports matched portion

5.2.1.5.2. (?:pattern): Non-capture group - groups without reporting

5.2.1.5.3. a|b: Alternation - match either a or b

5.2.1.6. Examples

5.2.1.6.1. .+: Any character one or more times

5.2.1.6.2. \d{4}: Four digits (e.g., "2019")

5.2.1.6.3. May \d, \d{4}: "May" followed by digit, comma, and four digits

5.2.1.6.4. ^\w+: One or more word characters at string start

5.2.1.6.5. \w+.$: One or more word characters plus any character at string end

5.2.2. Turning Text to Data with Regular Expression Functions

5.2.2.1. Creating a Table for Crime Reports

5.2.2.2. Matching Crime Report Date Patterns

5.2.2.2.1. Purpose

5.2.2.2.2. Using regexp_match() Function

5.2.2.2.3. Date Pattern Structure

5.2.2.2.4. Regular Expression Components

5.2.3. Matching Additional Crime Report Elements

5.2.3.1. First Hour Pattern

5.2.3.1.1. Pattern: /\d{2}\n(\d{4})

5.2.3.1.2. Components

5.2.3.1.3. Example: 4/16/17-4/17/17 2100-0900 hrs

5.2.3.2. Second Hour Pattern

5.2.3.2.1. Pattern: /\d{2}\n\d{4}-(\d{4})

5.2.3.2.2. Components

5.2.3.2.3. Example: 4/16/17-4/17/17 2100-0900 hrs

5.2.3.3. Street Address Pattern

5.2.3.3.1. Pattern: hrs.\n(\d+ .+(?:Sq.|Plz.|Dr.|Ter.|Rd.))

5.2.3.3.2. Components

5.2.3.3.3. Example: 21800 block Newlin Mill Rd

5.2.3.4. City Pattern

5.2.3.4.1. Pattern: (?:Sq.|Plz.|Dr.|Ter.|Rd.)\n(\w+ \w+|\w+)\n

5.2.3.4.2. Components

5.2.3.4.3. Example: Sterling, Middleburg

5.2.3.5. Crime Type Pattern

5.2.3.5.1. Pattern: \n(?:\w+ \w+|\w+)\n(.*):

5.2.3.5.2. Components

5.2.3.5.3. Example: Larceny, Destruction of Property

5.2.3.6. Description Pattern

5.2.3.6.1. Pattern: :\s(.+)(?:C0|SO)

5.2.3.6.2. Components

5.2.3.6.3. Example: A license plate was reported stolen from a vehicle

5.2.3.7. Case Number Pattern

5.2.3.7.1. Pattern: (?:C0|SO)[0-9]+

5.2.3.7.2. Components

5.2.3.7.3. Example: C0170006614, SO170006250

5.2.4. Extracting Text from the regexp_match() Result

5.2.4.1. Understanding Array Results

5.2.4.1.1. regexp_match() returns array of text values

5.2.4.1.2. Array indicator: text[] data type in column header

5.2.4.1.3. Visual indicator: curly brackets {} around results

5.2.4.1.4. Example: {C0170006614} instead of C0170006614

5.2.4.2. Extraction Method

5.2.4.2.1. Use array notation to access elements

5.2.4.2.2. Format: (regexp_match(text, pattern))[array_index]

5.2.4.2.3. Array index starts at 1, not 0

5.2.4.2.4. Syntax requires parentheses around regexp_match()

5.2.4.2.5. Example: (regexp_match(original_text, '(?:C0|SO)[0-9]+'))[1]

5.2.4.3. Extraction Process

5.2.4.3.1. Wrap regexp_match() function in parentheses

5.2.4.3.2. Add [1] to extract first element of array

5.2.4.4. Purpose

5.2.4.4.1. Prepare data for database updates

5.2.4.4.2. Match column data types (text vs. array)

5.2.5. Updating the crime_reports Table with Extracted Data

5.2.5.1. Process Overview

5.2.5.1.1. Extract elements from original text

5.2.5.1.2. Format data for timestamp compatibility

5.2.5.1.3. Use UPDATE statements to populate columns

5.2.5.1.4. Convert text patterns to structured data

5.2.5.2. Date and Time Extraction

5.2.5.2.1. Extract date with regexp_match(): \d{1,2}/\d{1,2}/\d{2}

5.2.5.2.2. Extract time with regexp_match(): /\d{2}\n(\d{4})

5.2.5.2.3. Access first element from array results with [1]

5.2.5.2.4. Example: (regexp_match(original_text, '\d{1,2}/\d{1,2}/\d{2}'))[1]

5.2.5.3. Timestamp Creation

5.2.5.3.1. Concatenate elements with double-pipe (||) operator

5.2.5.3.2. Format: date || ' ' || time || ' US/Eastern'

5.2.6. Using CASE to Handle Special Instances

5.2.6.1. UPDATE Statement Structure

5.2.6.1.1. Single statement updates all columns

5.2.6.1.2. More efficient than multiple statements

5.2.6.1.3. Handles inconsistent data patterns

5.2.6.1.4. Combines regex extraction with conditional logic

5.2.6.2. CASE Statement Logic

5.2.6.2.1. Tests multiple conditions sequentially

5.2.6.2.2. Format: CASE WHEN condition THEN result ELSE default END

5.2.6.2.3. Returns appropriate value based on conditions

5.2.7. Using Regular Expressions with WHERE

5.2.7.1. Operators

5.2.7.1.1. ~ : Case-sensitive regex match

5.2.7.1.2. ~* : Case-insensitive regex match

5.2.7.1.3. !~ : Negated case-sensitive match

5.2.7.1.4. !~* : Negated case-insensitive match

5.2.7.2. Advantages

5.2.7.2.1. More complex pattern matching than LIKE/ILIKE

5.2.7.2.2. Flexible pattern definitions

5.2.7.2.3. Supports alternation (either/or matching)

5.2.7.2.4. Can combine with other SQL conditions

5.2.7.3. Syntax

5.2.7.3.1. WHERE column_name ~* 'pattern'

5.2.7.3.2. WHERE column_name !~ 'pattern'

5.2.7.3.3. Can use AND/OR with other conditions

5.2.7.3.4. Patterns use standard regex notation

5.2.8. Additional Regular Expression Functions

5.2.8.1. regexp_replace(string, pattern, replacement)

5.2.8.2. regexp_split_to_table(string, pattern)

5.2.8.3. regexp_split_to_array(string, pattern)

5.2.8.4. Array Processing Functions

5.3. Full text search in postgresql

5.3.1. Text Search Data Types

5.3.2. Creating a Table for Full Text Search

6. CHAPTER 14: Analyzing Spatial Data with PostGIS

6.1. installing postgis and creating a spatial database

6.2. the building blocks of spatial data

6.3. two-dimensional geometries

6.3.1. Simple Feature Standard

6.3.1.1. Created by ISO and OGC

6.3.1.2. Defines 2D/3D shapes ("geometries")

6.3.1.3. Implemented by PostGIS

6.3.2. Basic Geometry Types

6.3.2.1. Point

6.3.2.1.1. Single location (X,Y coordinates)

6.3.2.1.2. Represented as dot on maps

6.3.2.1.3. Example: longitude/latitude

6.3.2.2. LineString

6.3.2.2.1. 2+ connected points

6.3.2.2.2. Straight line segments

6.3.2.2.3. Examples: roads, trails, streams

6.3.2.3. Polygon

6.3.2.3.1. 2D shape with 3+ sides

6.3.2.3.2. Closed area with boundaries

6.3.2.3.3. Can contain "holes" (interior polygons)

6.3.2.3.4. Examples: nations, buildings, lakes

6.3.2.4. MultiPoint

6.3.2.4.1. Collection of Points

6.3.2.4.2. Example: all locations of a retail chain

6.3.2.5. MultiLineString

6.3.2.5.1. Collection of LineStrings

6.3.2.5.2. Example: road with disconnected segments

6.3.2.6. MultiPolygon

6.3.2.6.1. Collection of Polygons

6.3.2.6.2. Example: parcels divided by roads

6.4. postgis data types

6.4.1. creating spatial objects with postgis Functions

6.4.1.1. ST_GeomFromText()

6.4.1.1.1. Primary geometry constructor

6.4.1.1.2. Syntax: ST_GeomFromText(WKT, SRID)

6.4.1.2. ST_GeogFromText()

6.4.1.2.1. Creates geography data type

6.4.1.3. Point Functions

6.4.1.3.1. ST_PointFromText()

6.4.1.3.2. ST_MakePoint()

6.4.1.3.3. ST_SetSRID()

6.4.1.4. LineString Functions

6.4.1.4.1. ST_LineFromText()

6.4.1.4.2. ST_MakeLine()

6.4.1.5. Polygon Functions

6.4.1.5.1. ST_PolygonFromText()

6.4.1.5.2. ST_MakePolygon()

6.4.1.5.3. ST_MPolyFromText()

6.4.2. analyzing Farmers’ markets data

6.4.2.1. Finding nearby markets for weekend visits

6.4.2.2. Distance-based analysis and planning

6.4.2.3. Similar to common map application features

6.5. working with census shapefiles

6.5.1. Shapefile Components

6.5.1.1. .shp - Main geometry storage file

6.5.1.2. .shx - Geometry index file

6.5.1.3. .dbf - Attribute database (dBASE format)

6.5.1.4. .xml - Optional metadata

6.5.1.5. .prj - Optional projection/coordinate system

6.5.2. Loading Methods

6.5.2.1. GUI Approach

6.5.2.2. Connection Settings

6.5.2.3. Import Parameters

6.6. performing spatial Joins

6.6.1. Key Concept

6.6.1.1. Joins based on spatial relationships

6.6.1.2. Links tables through geometry columns

6.6.1.3. Uses spatial predicates instead of equality

6.6.2. Spatial Join Functions

6.6.2.1. ST_Intersects()

6.6.2.2. ST_Intersection()

7. CHAPTER 15: Saving Time with Views, Functions, and Triggers

7.1. Using views to simplify queries

7.1.1. Definition

7.1.1.1. Stored queries → virtual tables

7.1.1.2. CREATE OR REPLACE VIEW name AS SELECT...

7.1.2. Structure

7.1.2.1. Simple views → filter columns/rows

7.1.2.1.1. Example: nevada_counties_pop_2010

7.1.2.1.2. SELECT 4/16 columns + WHERE filter

7.1.2.2. Complex views → calculations + joins

7.1.2.2.1. Example: county_pop_change_2010_2000

7.1.2.2.2. JOIN census tables + percent calculation

7.1.3. Management

7.1.3.1. CREATE → new view

7.1.3.2. REPLACE → update (constraints: same columns/types/order)

7.1.3.3. DROP → delete

7.1.3.4. Location: Schemas → public → Views

7.1.4. Usage

7.1.4.1. Query like tables: SELECT * FROM view_name

7.1.4.2. Filter: WHERE conditions

7.1.4.3. Column selection: specific vs all (*)

7.1.4.4. Performance: saves repeated complex queries

7.1.5. Benefits

7.1.5.1. Simplifies complexity

7.1.5.2. Standardizes calculations

7.1.5.3. Enables consistent reporting

7.2. Inserting, Updating, and Deleting Data Using a View

7.2.1. Creating a View of Employees

7.2.2. Inserting Rows Using the employees_tax_dept View

7.2.2.1. INSERT INTO view_name = INSERT INTO underlying_table

7.2.2.2. Changes in view → reflected in base table

7.2.2.3. View query reruns → shows updated results

7.2.3. Updating Rows Using the employees_tax_dept View

7.2.3.1. UPDATE view_name = UPDATE base_table

7.2.3.2. WHERE conditions apply normally

7.2.3.3. Changes affect underlying data

7.2.4. Deleting Rows Using the employees_tax_dept View

7.2.4.1. DELETE FROM view_name = DELETE FROM base_table

7.2.4.2. Removes data from underlying table

7.2.4.3. Same restrictions as inserts/updates apply

7.3. programming Your own Functions

7.3.1. Creating the percent_change() Function

7.3.1.1. Purpose: Reuse code, simplify calculations

7.3.1.2. Syntax: CREATE OR REPLACE FUNCTION name(args)

7.3.2. Using the percent_change() Function

7.3.2.1. Identical results to formula

7.3.2.2. Cleaner code

7.3.2.3. Less error-prone

7.3.2.4. Consistent calculations

7.3.2.5. Simpler maintenance

7.3.3. Updating Data with a Function

7.3.3.1. Automatic personal days calculation

7.3.3.2. Applies consistent business rules

7.3.3.3. Bulk data update

7.3.4. Using the Python Language in a Function

7.3.4.1. Leverage Python's capabilities

7.3.4.2. Access to Python libraries

7.3.4.3. Complex data processing

7.3.4.4. Text manipulation

7.3.4.5. Beyond SQL/PL/pgSQL limitations

7.4. automating database actions with triggers

7.4.1. Logging Grade Updates to a Table

7.4.2. Creating Tables to Track Grades and Updates

7.4.3. Creating the Function and Trigger

7.4.3.1. Automated audit trail

7.4.3.2. Change history preservation

7.4.3.3. No application code changes needed

7.4.3.4. Consistent enforcement

7.4.3.5. Data integrity protection

7.4.4. Automatically Classifying Temperatures

7.4.4.1. Automatic classification

7.4.4.2. No manual intervention needed

7.4.4.3. Real-time data processing

7.4.4.4. Consistent categorization

7.4.4.5. Ready for reporting/forecasting

8. CHAPTER 1: Creating Your First Database and Table

8.1. Creating a database

8.1.1. CREATE DATABASE analysis;

8.2. Creating a table

8.2.1. The CREATE TABLE Statement

8.3. Inserting rows into a table

8.4. Formatting sql for readability

8.4.1. Importance

8.4.1.1. No special formatting required to run

8.4.1.2. Enhances readability

8.4.1.3. Facilitates collaboration

8.4.2. Conventions

8.4.2.1. Keywords

8.4.2.1.1. Use UPPERCASE for SQL keywords (e.g., SELECT)

8.4.2.1.2. Optional: Uppercase data types (e.g., TEXT, INTEGER)

8.4.2.1.3. Book practice: Lowercase data types for distinction

8.4.2.2. Object Names

8.4.2.2.1. Avoid camelCase

8.4.2.2.2. Use lowercase_and_underscores (e.g., table_name, column_name)

8.4.2.2.3. More details in Chapter 7

8.4.2.3. Indentation

8.4.2.3.1. Indent clauses and code blocks

8.4.2.3.2. Use either

8.4.2.3.3. Tabs vs Spaces

9. CHAPTER 2: Beginning Data Exploration with SELECT

9.1. Basic select syntax

9.1.1. Quering all column

9.1.1.1. SELECT * FROM my_table;

9.1.2. Querying a Subset of Columns

9.1.2.1. SELECT some_column, another_column, amazing_column FROM table_name;

9.1.3. Using DISTINCT to Find Unique Values

9.2. Sorting data with ORDER BY

9.2.1. Purpose

9.2.1.1. Arranges data for better readability and pattern identification

9.2.1.2. Does not alter the original table, only query results

9.2.2. Syntax

9.2.2.1. ORDER BY column_name [ASC | DESC]

9.2.3. Default Behavior

9.2.3.1. Ascending order (ASC)

9.2.4. Descending Order

9.2.4.1. Use `DESC` for descending order

9.2.5. Example

9.2.6. Sorting Multiple Columns

9.2.6.1. Use comma-separated column names with sort order

9.2.6.2. Example

9.2.7. Best practices

9.2.7.1. Limit the number of columns in `ORDER BY` for clarity

9.2.7.2. Focus results on answering specific questions

9.2.7.3. Run multiple queries for complex sorting needs

9.3. Filtering rows with where

9.3.1. Purpose

9.3.1.1. Limit rows returned based on criteria

9.3.1.2. Use WHERE clause with operators to filter data

9.3.1.3. Exclude rows if needed

9.3.2. Syntax

9.3.3. Basic Example

9.3.4. Comparison Operators

9.3.4.1. Equal to (`=`): `WHERE school = 'Baker Middle'`

9.3.4.2. Not equal to (`<>` or `!=`): `WHERE school != 'Baker Middle'`

9.3.4.3. Greater than (`>`): `WHERE salary > 20000`

9.3.4.4. Less than (`<`): `WHERE salary < 60500`

9.3.4.5. Greater than or equal to (`>=`): `WHERE salary >= 20000`

9.3.4.6. Less than or equal to (`<=`): `WHERE salary <= 60500`

9.3.4.7. Between range (`BETWEEN`): `WHERE salary BETWEEN 20000 AND 40000`

9.3.4.8. Match set of values (`IN`): `WHERE last_name IN ('Bush', 'Roush')`

9.3.4.9. Pattern matching (case-sensitive `LIKE`, case-insensitive `ILIKE`)

9.3.4.10. Negation (`NOT`): `WHERE first_name NOT ILIKE 'sam%'`

9.3.5. Wildcard Characters

9.3.5.1. `%`: Matches one or more characters

9.3.5.2. `_`: Matches exactly one character

9.3.6. Combining Operators

9.3.6.1. Example

9.3.7. Best practices

9.3.7.1. Use `ILIKE` for case-insensitive searches

9.3.7.2. Combine filters with `AND`/`OR` for precision

9.3.7.3. Use `ORDER BY` to rank results

9.3.7.4. Optimize performance with indexes for large datasets

10. CHAPTER 3: Understanding Data Types

10.1. Charactor

10.1.1. CHAR(n)

10.1.1.1. Fixed-length column

10.1.1.2. Specified by n (e.g., char(20))

10.1.1.3. Pads with spaces if fewer characters inserted

10.1.1.4. Always stores exactly n characters

10.1.1.5. Alternative syntax: character(n)

10.1.1.6. Legacy type, used infrequently

10.1.2. VARCHAR(n)

10.1.2.1. Variable-length column

10.1.2.2. Maximum length specified by n

10.1.2.3. Stores only actual characters inserted

10.1.2.4. No padding with spaces

10.1.2.5. Space-efficient storage

10.1.2.6. Alternative syntax: character varying(n)

10.1.3. TEXT

10.1.3.1. Variable-length column

10.1.3.2. Unlimited length (up to ~1GB)

10.1.3.3. Not part of SQL standard

10.1.4. Performance Considerations

10.1.4.1. No substantial performance difference among types

10.1.4.2. VARCHAR and TEXT offer space savings

10.1.4.3. CHAR wastes space with padding

10.1.5. Usage Recommendations

10.1.5.1. Use VARCHAR for most cases

10.1.5.2. Choose n value to handle outliers

10.1.5.3. Use CHAR for truly fixed-length data (e.g., state codes)

10.1.5.4. Use TEXT when length is unpredictable

10.2. Number

10.2.1. Integer Types

10.2.1.1. Whole numbers (positive/negative)

10.2.1.2. Used for: counting, IDs, quantities

10.2.1.3. Types by Size

10.2.1.3.1. smallint

10.2.1.3.2. integer

10.2.1.3.3. bigint

10.2.1.4. Auto-Incrementing Types

10.2.1.4.1. Used for: primary keys, IDs

10.2.1.4.2. Auto-increment with each new row

10.2.2. Decimal Numbers

10.2.2.1. Fixed-Point Types

10.2.2.1.1. numeric(precision, scale) or scale)

10.2.2.2. Floating-Point Types

10.2.2.2.1. real

10.2.2.2.2. double precision

10.2.3. Considerations

10.2.3.1. Floating-Point Issues

10.2.3.1.1. Mathematical inaccuracies

10.2.3.1.2. Example: 0.7 × 10,000,000 ≠ 7,000,000

10.2.3.1.3. Uses less storage than fixed-point

10.2.3.2. Storage Efficiency

10.2.3.2.1. Smaller types use less space

10.2.3.2.2. Important for large databases

10.2.4. Selection Guidelines

10.2.4.1. Use integers when possible

10.2.4.2. Use numeric/decimal for exact calculations (money)

10.3. Dates and Times

10.3.1. Overview

10.3.1.1. Essential for tracking when events occur

10.3.1.2. Server-based time awareness

10.3.1.3. Calendar-aware (leap years, time zones)

10.3.1.4. Key for data storytelling

10.3.2. Core Data Types

10.3.2.1. TIMESTAMP

10.3.2.1.1. Date and time together

10.3.2.1.2. Variant: timestamp with time zone (timestamptz)

10.3.2.1.3. Most useful with time zone included

10.3.2.2. DATE

10.3.2.2.1. Date only (no time)

10.3.2.3. TIME

10.3.2.3.1. Time only (no date)

10.3.2.3.2. Can include time zone

10.3.2.4. INTERVAL

10.3.2.4.1. Represents duration of time

10.3.2.4.2. Format: quantity unit (12 days, 8 hours)

10.3.2.4.3. Units from microsecond to millennium

10.3.3. Time Zone Handling

10.3.3.1. Specification Methods

10.3.3.1.1. Named time zones (Australia/Melbourne)

10.3.3.1.2. Abbreviations (EST)

10.3.3.1.3. UTC offset (-8)

10.3.3.1.4. Time zone database (tz database)

10.3.3.2. Display Considerations

10.3.3.2.1. Values displayed relative to user's time zone

10.3.3.2.2. UTC offset shown (e.g., -05)

10.3.4. Input Formats

10.3.4.1. ISO format (YYYY-MM-DD HH:MM:SS)

10.3.4.2. Other supported formats (MM/DD/YYYY)

10.3.5. Calculations

10.3.5.1. Math operations with interval type

10.3.5.2. Example: timestamp - interval

10.3.5.3. Creates new timestamp values

10.3.5.4. Simplifies date/time calculations

10.4. Miscellaneous Types

10.4.1. Boolean Type

10.4.1.1. Stores true or false values

10.4.1.2. Used for flags and conditions

10.4.2. Geometric Types

10.4.2.1. Two-dimensional objects

10.4.2.2. Points

10.4.2.3. Lines

10.4.2.4. Circles

10.4.2.5. Other geometric shapes

10.4.3. Network Address Types

10.4.3.1. IP addresses

10.4.3.2. MAC addresses

10.4.3.3. Network identifiers

10.4.4. UUID Type

10.4.4.1. Universally Unique Identifier

10.4.4.2. Often used as unique key values in tables

10.4.4.3. Alternative to serial/sequence IDs

10.4.5. Structured Data Types

10.4.5.1. XML data type

10.4.5.2. JSON data type

10.5. Transforming Values from One Type to Another with CAST

10.5.1. Purpose

10.5.1.1. Transform values from one data type to another

10.5.1.2. Needed for

10.5.1.2.1. Combining numbers with text

10.5.1.2.2. Treating text dates as date types

10.5.1.2.3. Sorting based on different type rules

10.5.1.2.4. Performing calculations on converted values

10.5.2. Syntax

10.5.2.1. CAST(expression AS target_type)

10.5.2.2. Example

10.5.3. Valid Conversions

10.5.3.1. Numbers to text

10.5.3.1.1. Always works

10.5.3.1.2. Integer → varchar

10.5.3.1.3. Numeric → varchar

10.5.3.2. Timestamps to text

10.5.3.2.1. Can extract just date portion

10.5.3.2.2. CAST(timestamp AS varchar(10))

10.5.3.3. Numbers to other number types

10.5.3.3.1. Numeric → integer (with rounding)

10.5.3.3.2. Integer → numeric

10.5.4. Limitations

10.5.4.1. Target type must accommodate original value

10.5.4.2. Cannot convert

10.5.4.2.1. Text with letters to numbers

10.5.4.2.2. Invalid date strings to dates

10.6. CAST Shortcut notaion

10.6.1. Double Colon (::) Shortcut

10.6.1.1. Example

10.6.2. Best practices

10.6.2.1. Choose based on readability

10.6.2.2. Consider future code maintainers

10.6.2.3. Be consistent within a project

10.7. Try it yourself

10.7.1. 1. Your company delivers fruit and vegetables to local grocery stores, and you need to track the mileage driven by each driver each day to a tenth of a mile . Assuming no driver would ever travel more than 999 miles in a day, what would be an appropriate data type for the mileage column in your table? Why?

10.7.2. 2. In the table listing each driver in your company, what are appropriate data types for the drivers’ first and last names? Why is it a good idea to separate first and last names into two columns rather than having one larger name column?

10.7.3. 3. Assume you have a text column that includes strings formatted as dates . One of the strings is written as '4//2017' . What will happen when you try to convert that string to the timestamp data type?

11. CHAPTER 4: Importing and Exporting Data

11.1. Working with delimited text Files

11.1.1. Basics

11.1.1.1. Universal data exchange format

11.1.1.2. Bridge between different software applications

11.1.1.3. Plain text format with structured data

11.1.1.4. Alternative to proprietary formats

11.1.2. Structure

11.1.2.1. Rows

11.1.2.1.1. represent table records

11.1.2.2. Columns

11.1.2.2.1. eparated by delimiter characters

11.1.2.3. Common delimiters

11.1.2.3.1. Comma (most common)

11.1.2.3.2. Pipe (|)

11.1.2.3.3. Tab

11.1.2.3.4. Semicolon (;)

11.1.2.3.5. Ampersand (&)

11.1.3. CSV Files

11.1.3.1. CSV = Comma-Separated Values

11.1.3.2. Most common delimited file format

11.1.3.3. "CSV" and "comma-delimited" are interchangeable terms

11.1.3.4. Widely supported by

11.1.3.4.1. Databases

11.1.3.4.2. Spreadsheets

11.1.3.4.3. Programming languages

11.1.4. Delimiter Challenges

11.1.4.1. Problem

11.1.4.1.1. Values containing delimiter characters

11.1.4.1.2. Example: address with comma

11.1.4.2. Solution

11.1.4.2.1. Text Qualifiers

11.1.4.2.2. Most common: double quotes (")

11.2. Using COPY to import data

11.2.1. Basic Syntax

11.2.1.1. COPY table_name FROM 'file_path' WITH (options);

11.2.1.2. Table must already exist in database

11.2.1.3. Command means "Copy data TO my table FROM this file"

11.2.2. File Path Specification

11.2.2.1. Windows Format

11.2.2.1.1. Drive letter + colon + backslash + directories

11.2.2.1.2. Example: `'C:\Users\Anthony\Desktop\my_file.csv'`

11.2.2.2. macOS/Linux Format

11.2.2.2.1. Forward slash + directories from root

11.2.2.2.2. Example: `'/Users/anthony/Desktop/my_file.csv'`

11.2.2.3. Always enclose path in single quotes

11.2.3. WITH Options

11.2.3.1. Enclosed in parentheses

11.2.3.2. Multiple options separated by commas

11.2.3.3. Example: `WITH (FORMAT CSV, HEADER);`

11.2.4. Common COPY Options

11.2.4.1. FORMAT

11.2.4.1.1. Specifies file type

11.2.4.1.2. Options

11.2.4.2. HEADER

11.2.4.2.1. `HEADER` or `HEADER ON`

11.2.4.2.2. Indicates source file has column names in first row

11.2.4.2.3. On import: Skips first row during import

11.2.4.2.4. On export: Includes column names as first row

11.2.4.3. DELIMITER

11.2.4.3.1. `DELIMITER 'character'`

11.2.4.3.2. Specifies the column separator character

11.2.4.3.3. Default for CSV is comma

11.2.4.3.4. Example for pipe-delimited: `DELIMITER '|'`

11.2.4.3.5. Must be a single character (not carriage return)

11.2.4.4. QUOTE

11.2.4.4.1. `QUOTE 'character'`

11.2.4.4.2. Specifies text qualifier character

11.2.4.4.3. Default is double quote (")

11.2.4.4.4. Used to wrap values containing delimiter characters

11.3. Importing census data describing counties

11.3.1. SELECT queries to check data

11.3.2. ORDER BY with LIMIT to view specific data subsets

11.3.3. Checking large values (Alaska land areas)

11.3.4. Checking in the gitlab SQL

11.4. Importing a subset of columns with COPY

12. CHAPTER 5: Basic Math and Stats with SQL

12.1. Math operators

12.1.1. Basic Operators (ANSI SQL Standard)

12.1.1.1. Addition (+)

12.1.1.1.1. Used with: numbers, dates, intervals

12.1.1.1.2. Example: SELECT 2 + 2; → 4

12.1.1.2. Subtraction (-)

12.1.1.2.1. Example: SELECT 9 - 1; → 8

12.1.1.2.2. Used with: numbers, dates, intervals

12.1.1.3. Multiplication (*)

12.1.1.3.1. Example: SELECT 3 * 4; → 12

12.1.1.3.2. Used with: numbers

12.1.1.4. Division (/)

12.1.1.4.1. Example: SELECT 11 / 6; → 1 (integer division)

12.1.1.4.2. Returns quotient only (no remainder)

12.1.1.4.3. Integer division drops decimal portion

12.1.2. PostgreSQL-Specific Operators

12.1.2.1. Modulo (%)

12.1.2.1.1. Returns only the remainder

12.1.2.1.2. Example: SELECT 11 % 6; → 5

12.1.2.1.3. Useful for

12.1.2.2. Exponentiation (^)

12.1.2.2.1. Raises base to power of exponent

12.1.2.2.2. Example: SELECT 3 ^ 4; → 81 (3 to the 4th power)

12.1.2.3. Square Root (|/)

12.1.2.3.1. Prefix operator

12.1.2.3.2. Example: SELECT |/ 10; → 3.16227766...

12.1.2.4. Cube Root (||/)

12.1.2.4.1. Prefix operator

12.1.2.4.2. Example: SELECT ||/ 10; → 2.15443469...

12.1.2.5. Factorial (!)

12.1.2.5.1. Suffix operator

12.1.2.5.2. Example: SELECT 4 !; → 24 (4×3×2×1)

12.1.2.5.3. Used for permutation calculations

12.1.3. Practical Usage

12.1.3.1. Simple calculations with SELECT

12.1.3.2. Column computations in tables

12.1.3.3. Conditional logic with mathematical tests

12.1.3.4. Math in WHERE clauses for filtering

12.2. Doing math across census table columns

12.2.1. Working with Census Data

12.2.1.1. US Census 2010 data (us_counties_2010)

12.2.1.2. 3,143 counties

12.2.1.3. Population breakdowns by race

12.2.1.4. Column naming conventions

12.2.1.4.1. Census codes (e.g., p0010006)

12.2.1.4.2. Using AS for readable column aliases

12.2.2. Basic Column Math

12.2.2.1. Adding Columns

12.2.2.1.1. Example: `p0010003 + p0010004 AS "Total White and Black"`

12.2.2.1.2. Calculations performed row-by-row

12.2.2.1.3. Always use aliases for calculated columns

12.2.2.2. Verifying Data Integrity

12.2.2.2.1. Checking that race categories sum to total population

12.2.2.2.2. Formula: `p0010003 + p0010004 + p0010005 + p0010006 + p0010007 + p0010008 + p0010009`

12.2.2.2.3. Creating a "Difference" column to highlight discrepancies

12.2.2.2.4. Ordering by difference to spot problems

12.2.3. Calculating Percentages

12.2.3.1. Formula

12.2.3.1.1. (part / whole) * 100

12.2.3.2. Example

12.2.3.2.1. `(CAST(p0010006 AS numeric(8,1)) / p0010001) * 100 AS "pct_asian"`

12.2.3.3. Data Type Considerations

12.2.3.3.1. Need to CAST integers to numeric for decimal division

12.2.3.3.2. Integer division truncates decimals (returns 0 for small percentages)

12.3. Aggregate Functions for averages and sums

12.4. Finding the median

12.4.1. Finding the Median with Percentile Functions

12.4.2. Median and Percentiles with Census Data

12.4.3. Finding Other Quantiles with Percentile Functions

12.4.4. Creating a median() Function

12.5. Finding the mode

12.5.1. Example

12.5.1.1. SELECT mode() WITHIN GROUP (ORDER BY p0010001) FROM us_counties_2010;

13. CHAPTER 6: Joining Tables in a Relational Database

13.1. Linking tables using Join

13.1.1. Purpose

13.1.1.1. Connect multiple tables in a single query

13.1.1.2. Combine related data from different tables

13.1.1.3. Create relationships between datasets

13.1.1.4. Enable complex data analysis across tables

13.1.2. JOIN Syntax

13.2. Relating tables with key columns

13.2.1. Database Scenario

13.2.1.1. Public agency payroll analysis

13.2.1.2. Data received as separate tables instead of one spreadsheet

13.2.1.3. Two key tables: employees and departments

13.2.1.4. Need to connect employee data to their departments

13.2.2. Table Structure

13.2.2.1. Departments Table

13.2.2.1.1. dept_id (Primary Key)

13.2.2.1.2. dept (department name)

13.2.2.1.3. city (location)

13.2.2.2. Employees Table

13.2.2.2.1. emp_id (Primary Key)

13.2.2.2.2. first_name, last_name

13.2.2.2.3. salary

13.2.2.2.4. dept_id (Foreign Key → departments.dept_id)

13.2.3. Key Columns

13.2.3.1. Primary Key

13.2.3.1.1. Uniquely identifies each row in a table

13.2.3.1.2. Requirements

13.2.3.2. Foreign Key

13.2.3.2.1. References primary key in another table

13.2.3.2.2. Creates relationships between tables

13.2.3.2.3. Requirements

13.2.4. Constraints

13.2.4.1. PRIMARY KEY constraint

13.2.4.2. FOREIGN KEY constraint

13.2.4.3. UNIQUE constraint

13.2.5. Advantages of Relational Structure

13.2.5.1. Eliminates data redundancy

13.2.5.2. Simplifies data management

13.2.5.3. Maintains data integrity

13.3. Querying multiple tables using JOIN

13.3.1. JOIN Concept

13.3.1.1. Connects rows across tables based on matching column values

13.3.1.2. Combines data from multiple tables into a single result set

13.3.1.3. Returns columns from both tables where values match

13.3.1.4. Can filter joined results using WHERE clause

13.3.2. JOIN Syntax

13.4. JOIN types

13.4.1. JOIN (INNER JOIN)

13.4.1.1. Definition

13.4.1.1.1. Returns only rows where values match in both tables

13.4.1.2. Syntax

13.4.1.3. Alternative syntax

13.4.1.3.1. INNER JOIN

13.4.1.4. Visual concept

13.4.1.4.1. Intersection of two tables

13.4.1.5. Returns

13.4.1.5.1. Only rows with matching values in the joined columns

13.4.1.6. Key characteristic

13.4.1.6.1. Excludes non-matching rows from both tables

13.4.2. LEFT JOIN

13.4.2.1. Definition

13.4.2.1.1. Returns all rows from left table plus matching rows from right

13.4.2.2. Syntax

13.4.2.3. Visual concept

13.4.2.3.1. Complete left table with matching right table data

13.4.2.4. Returns

13.4.2.4.1. All rows from left table

13.4.2.4.2. Matching rows from right table

13.4.2.4.3. NULL values for right table when no match exists

13.4.2.5. Key characteristic

13.4.2.5.1. Preserves all records from the table before JOIN keyword

13.4.3. RIGHT JOIN

13.4.3.1. Definition

13.4.3.1.1. Returns all rows from right table plus matching rows from left

13.4.3.2. Syntax

13.4.3.3. Visual concept

13.4.3.3.1. Complete right table with matching left table data

13.4.3.4. Returns

13.4.3.4.1. All rows from right table

13.4.3.4.2. Matching rows from left table

13.4.3.4.3. NULL values for left table when no match exists

13.4.3.5. Key characteristic

13.4.3.5.1. Preserves all records from the table after JOIN keyword

13.4.4. FULL OUTER JOIN

13.4.4.1. Definition

13.4.4.1.1. Returns all rows from both tables

13.4.4.2. Syntax

13.4.4.3. Visual concept

13.4.4.3.1. Union of two tables with matches connected

13.4.4.4. Returns

13.4.4.4.1. All rows from left table

13.4.4.4.2. All rows from right table

13.4.4.4.3. NULL values where no matches exist

13.4.4.5. Key characteristic

13.4.4.5.1. Preserves all records from both tables

13.4.5. CROSS JOIN

13.4.5.1. Definition

13.4.5.1.1. Returns every possible combination of rows

13.4.5.2. Syntax

13.4.5.3. Visual concept

13.4.5.3.1. Cartesian product of two tables

13.4.5.4. Returns

13.4.5.4.1. Each row from left table combined with every row from right table

13.4.5.5. Key characteristic

13.4.5.5.1. Result size equals (rows in table1) × (rows in table2)

13.5. Using NULL to Find rows with missing values

13.5.1. NULL Concept

13.5.1.1. Special database value ≠ 0 or ""

13.5.1.2. Represents missing/unknown data

13.5.1.3. Cross-type compatible

13.5.2. NULL in JOINs

13.5.2.1. LEFT JOIN → NULLs in right table columns

13.5.2.2. RIGHT JOIN → NULLs in left table columns

13.5.2.3. FULL JOIN → NULLs in both tables' columns

13.5.3. Finding Missing Data

13.5.3.1. WHERE column IS NULL

13.5.3.2. WHERE column IS NOT NULL

13.5.3.3. Cannot use: =, !=, <, >

13.5.4. Query Pattern

13.6. Three types of table relationships

13.6.1. One-to-One (1:1)

13.6.1.1. Definition

13.6.1.1.1. One record in Table A matches exactly one record in Table B

13.6.1.2. Key Structure

13.6.1.2.1. Primary keys match uniquely across tables

13.6.1.2.2. No duplicate keys in either table

13.6.1.3. Examples

13.6.1.3.1. Person ↔ Passport

13.6.1.3.2. State ↔ Capital

13.6.1.3.3. Country ↔ President

13.6.2. One-to-Many (1:N)

13.6.2.1. Definition

13.6.2.1.1. One record in Table A matches multiple records in Table B

13.6.2.2. Key Structure

13.6.2.2.1. Primary key in parent table

13.6.2.2.2. Foreign key in child table (can repeat)

13.6.2.3. Examples

13.6.2.3.1. Manufacturer ↔ Car Models

13.6.2.3.2. Author ↔ Books

13.6.2.3.3. Department ↔ Employees

13.6.3. Many-to-Many (M:N)

13.6.3.1. Definition

13.6.3.1.1. Multiple records in Table A match multiple records in Table B

13.6.3.2. Key Structure

13.6.3.2.1. Requires junction/bridge table

13.6.3.2.2. Two one-to-many relationships

13.6.3.3. Examples

13.6.3.3.1. Students ↔ Courses

13.6.3.3.2. Players ↔ Positions

13.6.3.3.3. Books ↔ Categories

13.7. Selecting specific columns in a Join

13.7.1. Why Specify Columns

13.7.1.1. Targeted data retrieval

13.7.1.2. Protected from schema changes

13.7.1.3. Better performance

13.7.1.4. Clear output structure

13.7.2. The Ambiguity Issue

13.7.2.1. Same column name → error

13.7.3. Solution Path

13.7.3.1. Table Qualification

13.7.3.2. Column Aliasing

13.7.4. Query Pattern

13.8. Simplifying Join syntax with table aliases

13.8.1. Purpose

13.8.1.1. Reduce code verbosity

13.8.1.2. Improve readability

13.8.1.3. Simplify column references

13.8.1.4. Make long queries manageable

13.8.2. Syntax

13.8.2.1. Declared in FROM clause

13.8.2.2. Pattern: `table_name AS alias`

13.8.2.3. AS keyword is optional

13.8.2.4. Examples

13.8.2.4.1. `schools_left AS lt`

13.8.2.4.2. `employees AS e`

13.8.2.4.3. `customer_accounts AS ca`

13.9. Joining multiple tables

13.9.1. Core Concept

13.9.1.1. Join unlimited tables in single query

13.9.1.2. Each join connects via matching column values

13.9.1.3. Chain joins sequentially in query

13.9.1.4. Some database systems may have limits

13.9.2. Join Chain Pattern

13.9.2.1. TableA → TableB → TableC → TableD...

13.9.2.2. Each connection requires JOIN clause

13.9.2.3. Example

13.9.3. Syntax Structure

13.10. Performing math on Joined table columns

13.10.1. Core Concept

13.10.1.1. Apply math operations across joined tables

13.10.1.2. Reference columns using table aliases

13.10.1.3. Calculate changes between related datasets

13.10.1.4. Formula pattern: `table1.column OP table2.column`

13.10.2. Implementation

13.10.2.1. Table Setup

13.10.2.1.1. CREATE matched tables (e.g., census_2010, census_2000)

13.10.2.1.2. JOIN on unique identifiers (primary/foreign keys)

13.10.2.1.3. Reference columns with table aliases

13.10.2.2. Calculation Types

13.10.2.2.1. Simple: `t1.value - t2.value` (raw difference)

13.10.2.2.2. Complex: `(t1.value - t2.value)/t2.value * 100` (percent change)

13.10.2.2.3. Apply functions: round(), cast(), avg(), etc

13.10.3. Example Query Pattern

14. CHAPTER 7: Table Design That Works for You

14.1. Naming tables, columns, and other identifiers

14.1.1. Naming Styles

14.1.1.1. Snake Case

14.1.1.1.1. `berry_smoothie`

14.1.1.1.2. Words separated by underscores

14.1.1.1.3. All lowercase

14.1.1.1.4. PostgreSQL documentation standard

14.1.1.2. Camel Case

14.1.1.2.1. `berrySmoothie`

14.1.1.2.2. First word lowercase

14.1.1.2.3. Subsequent words capitalized

14.1.1.3. Pascal Case

14.1.1.3.1. `BerrySmoothie`

14.1.1.3.2. All words capitalized

14.1.1.3.3. Microsoft SQL Server recommendation

14.1.2. Quoting Identifiers

14.1.2.1. Purpose

14.1.2.1.1. Enable mixed case: `"Customers"`

14.1.2.1.2. Allow spaces: `"trees planted"`

14.1.2.1.3. Permit reserved keywords: `"SELECT"`

14.1.2.2. Consequences

14.1.2.2.1. Must use quotes in all references

14.1.2.2.2. Case sensitivity enforced

14.1.2.2.3. Increased error potential

14.1.2.2.4. Example: `customers` ≠ `"Customers"`

14.1.3. Best Practices

14.1.3.1. Use Snake Case

14.1.3.1.1. `video_on_demand` > `videoondemand`

14.1.3.1.2. Most readable option

14.1.3.1.3. Avoids quoting requirements

14.1.3.2. Naming Guidelines

14.1.3.2.1. Descriptive over cryptic: `arrival_time` > `arv_tm`

14.1.3.2.2. Plural for tables: `teachers`, `vehicles`

14.1.3.2.3. Watch length limitations (PostgreSQL: 63 chars)

14.1.3.2.4. Date-stamp copies: `table_name_YYYY_MM_DD`

14.1.3.3. Consistency

14.1.3.3.1. Follow organizational style guide

14.1.3.3.2. Apply consistently across database

14.1.3.3.3. Avoid mixed conventions

14.2. Controlling column values with constraints

14.2.1. Primary Keys: Natural vs. Surrogate

14.2.1.1. Natural Keys

14.2.1.1.1. Core Concept

14.2.1.1.2. Requirements

14.2.1.1.3. Implementation Methods

14.2.1.1.4. Examples

14.2.1.1.5. Limitations

14.2.1.2. Surrogate Keys

14.2.1.2.1. Core Concept

14.2.1.2.2. Characteristics

14.2.1.2.3. Implementation Types

14.2.1.2.4. Advantages

14.2.1.3. Primary Key Syntax

14.2.1.3.1. Inline constraint

14.2.1.3.2. Table-level constraint

14.2.1.4. Creating a Composite Primary Key

14.2.1.4.1. Example

14.2.1.5. Creating an Auto-Incrementing Surrogate Key

14.2.1.5.1. serial

14.2.1.5.2. bigserial

14.2.1.5.3. smallserial

14.2.1.5.4. Example

14.2.2. Foreign Keys

14.2.2.1. Core Concept

14.2.2.1.1. Column(s) in a table referencing primary key of another table

14.2.2.1.2. Ensures data consistency and referential integrity

14.2.2.1.3. Prevents orphaned rows in related tables

14.2.2.2. Key Features

14.2.2.2.1. Values must exist in the referenced table

14.2.2.2.2. Enforces relationships across tables

14.2.2.2.3. Rejects invalid data inserts or updates

14.2.2.3. Example

14.2.3. Automatically Deleting Related Records with CASCADE

14.2.3.1. Core Concept

14.2.3.1.1. Automatically deletes related records in child tables when a parent record is deleted.

14.2.3.1.2. Maintains referential integrity

14.2.3.1.3. Prevents orphaned rows in related tables

14.2.3.2. Syntax

14.2.3.2.1. Add `ON DELETE CASCADE` to foreign key definition

14.2.3.2.2. Example

14.2.4. The CHECK Constraint

14.2.4.1. Core Concept

14.2.4.1.1. Ensures column values meet logical criteria.

14.2.4.1.2. Prevents invalid or nonsensical data

14.2.4.1.3. Rejects data that fails the logical test

14.2.4.2. Syntax

14.2.4.2.1. Column Constraint

14.2.4.2.2. Table Constraint

14.2.4.3. Use Cases

14.2.4.3.1. Restrict range of values

14.2.4.3.2. Limit to valid options

14.2.4.3.3. Ensure relationships between columns

14.2.5. The UNIQUE Constraint

14.2.5.1. Core Concept

14.2.5.1.1. Ensures column values are unique across all rows.

14.2.5.1.2. Allows multiple `NULL` values (unlike primary keys)

14.2.5.1.3. Prevents duplicate entries in specified columns

14.2.5.2. Syntax

14.2.5.2.1. Define in `CREATE TABLE` statement

14.2.6. The NOT NULL Constraint

14.2.6.1. Core Concept

14.2.6.1.1. Ensures a column cannot contain `NULL` values

14.2.6.1.2. Guarantees a value is present in every row for the specified column

14.2.6.1.3. Prevents empty or unknown data

14.2.6.2. Syntax

14.2.6.2.1. Declared after the column definition

14.2.7. Removing Constraints or Adding Them Later

14.2.7.1. Core Concept

14.2.7.1.1. Constraints can be added or removed after table creation

14.2.7.1.2. Use `ALTER TABLE` for modifying constraints

14.2.7.1.3. Ensures flexibility in database schema management

14.2.7.2. Syntax

14.3. Speeding up queries with indexes

14.3.1. B-Tree: PostgreSQL’s Default Index

14.3.1.1. Core Concept

14.3.1.1.1. Default index type in PostgreSQL.

14.3.1.1.2. Automatically created for:

14.3.1.1.3. Supports equality and range operations.

14.3.1.2. Characteristics

14.3.1.2.1. Data Storage

14.3.1.2.2. Structure

14.3.1.3. Supported Operations

14.3.1.3.1. Equality: `=`

14.3.1.3.2. Range: `<`, `<=`, `>=`, `>`, `BETWEEN`

14.3.1.4. Other PostgreSQL Index Types

14.3.1.4.1. Generalized Inverted Index (GIN)

14.3.1.4.2. Generalized Search Tree (GiST)

14.3.1.5. Advantages

14.3.1.5.1. Speeds up search and retrieval operations

14.3.1.5.2. Handles large datasets efficiently

14.3.1.5.3. Supports a wide range of operations

14.3.2. Benchmarking Query Performance with EXPLAIN

14.3.2.1. Core Concept

14.3.2.1.1. EXPLAIN Command

14.3.2.1.2. EXPLAIN with ANALYZE

14.3.2.2. Purpose

14.3.2.2.1. Measure performance improvements before and after adding an index

14.3.2.2.2. Understand database behavior for a query

14.3.2.3. Output

14.3.2.3.1. Query Plan Details

14.3.2.4. Use Cases

14.3.2.4.1. Benchmarking queries

14.3.2.4.2. Debugging slow queries

14.3.2.4.3. Optimizing database performance

14.3.2.5. Recording Some Control Execution Times

14.3.2.5.1. Useful for analyzing and optimizing query performance

14.3.2.6. Adding an Index

14.3.2.6.1. Adding an Index

14.3.2.6.2. Removing an Index

14.3.3. Considerations When Using Indexes

14.3.3.1. Core Concept

14.3.3.1.1. Indexes improve query performance but come with storage and maintenance costs.

14.3.3.1.2. Not every column needs an index.

14.3.3.2. Key Considerations

14.3.3.2.1. Database Documentation

14.3.3.2.2. Table Joins

14.3.3.2.3. Query Performance

14.3.3.2.4. Testing and Optimization

15. CHAPTER 8: Extracting Information by Grouping and Summarizing

15.1. Creating the library survey tables

15.1.1. Creating the 2014 Library Data Table

15.1.2. Creating the 2009 Library Data Table

15.2. Exploring the library data using aggregate Functions

15.2.1. Counting Rows and Values Using count()

15.2.1.1. Core Concept

15.2.1.1.1. `count()` is an aggregate function part of the ANSI SQL standard

15.2.1.1.2. Provides an easy way to count rows and values in a table

15.2.1.1.3. Using `count(*)`

15.2.2. Counting Values Present in a Column

15.2.2.1. Core Concept

15.2.2.1.1. `count(column_name)` counts the number of rows in a specific column that contain **non-NULL values**

15.2.2.1.2. Different from `count(*)`, which counts all rows regardless of `NULL` values

15.2.2.2. Recommendations

15.2.2.2.1. Validate Data

15.2.2.2.2. Expert Consultation

15.2.2.2.3. Broader Analysis

15.2.3. Counting Distinct Values in a Column

15.2.3.1. Core Concept

15.2.3.1.1. Used after `SELECT` to return unique values in a column or combinations of columns

15.2.3.1.2. Can be combined with `count()` to count distinct values

15.2.4. Finding Maximum and Minimum Values Using max() and min()

15.2.4.1. Purpose

15.2.4.1.1. Identify the largest and smallest values in a column

15.2.4.1.2. Gain an understanding of the scope of values

15.2.4.1.3. Detect potential data issues

15.2.4.2. Functions

15.2.4.2.1. `max(column_name)`: Returns the maximum value

15.2.4.2.2. `min(column_name)`: Returns the minimum value

15.2.5. Aggregating Data Using GROUP BY

15.2.5.1. Purpose

15.2.5.1.1. Group results based on one or more column values

15.2.5.1.2. Perform aggregate operations (e.g., `SUM()`, `COUNT()`) for each group

15.2.5.2. Behavior

15.2.5.2.1. Eliminates duplicate values, similar to `DISTINCT`

15.2.5.3. Using GROUP BY on Multiple Columns with count()

15.2.5.3.1. Purpose

15.2.5.3.2. Behavior

15.2.5.4. Revisiting sum() to Examine Library Visits

15.2.5.4.1. Purpose

15.2.5.4.2. Key Techniques

15.2.5.5. Grouping Visit Sums by State

15.2.5.6. Filtering an Aggregate Query Using HAVING

或注册