1. Getting ready
1.1. Data
1.1.1. Peru's "districts"
1.2. PreProcessing
1.2.1. for clustering
1.2.1.1. rescaling
1.2.1.1.1. control outliers
1.2.2. for both clustering and regression
1.2.2.1. same monotony of variable measurement
1.2.2.1.1. reversing may be needed
1.2.2.2. meaningful variable names
1.2.2.2.1. recoding may be needed
1.3. The Seed
1.3.1. keep the seed set, so you will see the same results I show
2. Clustering
2.1. Conventional
2.1.1. Grouping
2.1.1.1. WHAT
2.1.1.1.1. Multiple cases (rows)
2.1.1.2. LOOKING FOR
2.1.1.2.1. Homogeneity
2.1.1.2.2. Heterogeneity
2.1.2. several techniques
2.1.2.1. clusters all cases in the data
2.1.2.1.1. for example: **KMEANS**
2.1.2.2. clusters what "makes sense"
2.1.2.2.1. may leave cases isolated
2.1.3. statistical coherent
2.1.4. supportive of policy
2.1.4.1. profiling the clusters
2.1.4.1.1. generally straighforward
2.2. Spatial Clustering / **REGIONALIZATION**
2.2.1. Grouping
2.2.1.1. WHAT
2.2.1.2. LOOKING FOR
2.2.1.3. BUT....
2.2.1.3.1. forces contiguity/proximity
2.2.2. several techniques
2.2.2.1. Spatial KMeans (**RegionKMeansHeuristic**)
2.2.2.1.1. input
2.2.2.1.2. process
2.2.2.1.3. output
2.2.2.2. Apriori Zoning Problem (**AZP**)
2.2.2.2.1. input
2.2.2.2.2. process
2.2.2.2.3. output
2.2.2.3. Spatial Clustering by Tree Edge Removal (**SKATER**)
2.2.2.3.1. input
2.2.2.3.2. process
2.2.2.3.3. output
2.2.3. statistical coherent, but...
2.2.3.1. geographical coherence is more important
2.2.4. supportive of policy
2.2.4.1. relevant for regional policy making
2.2.4.1.1. planing
2.2.4.1.2. allocation
2.2.4.1.3. intervention
2.2.4.2. generally, profiling is messy
2.2.4.2.1. more connected to reality
2.3. Non-Spatial vs Spatial trade-off
2.3.1. Compactness
2.3.1.1. The isoperimetric quotient (IPQ)
2.3.1.1.1. from 0 to 1 (1 is best)
2.3.1.1.2. It penalizes long, wiggly, or highly elongated shapes
2.3.1.2. Convex hull ratio (CHR)
2.3.1.2.1. from 0 to 1 (1 is best)
2.3.1.2.2. It penalizes "pitted" or "re-entrant" shapes where the boundary folds inward
2.3.2. Homogeneity
2.3.2.1. Silhouette (SIL) Score:
2.3.2.1.1. It measures how similar a data point is to its own cluster compared to other clusters.
2.3.3. Heterogeneity
2.3.3.1. Calinski-Harabasz (CH) Score
2.3.3.1.1. measures the quality of clusters by comparing the between-cluster dispersion (how separated the clusters are) to the within-cluster dispersion (how tight the clusters are)
2.3.4. further steps
2.3.4.1. profiling
2.3.4.1.1. within techniques
2.3.4.1.2. between techniques
3. Regression
3.1. Conventional OLS
3.1.1. predict/explain
3.1.1.1. average behavior of a variable (the dependent variable
3.1.1.2. from the behavior of other variables (independent variables)
3.1.1.2.1. predictor (covariates)
3.1.1.2.2. control variables
3.1.2. returns
3.1.2.1. an equation
3.1.2.2. residuals
3.1.2.3. results for interpretation
3.1.2.3.1. Model Fit (comparability)
3.1.2.3.2. Regression Table
3.1.2.4. diagnostics
3.1.2.4.1. No multicollinearity
3.1.2.4.2. Normality of residuals
3.1.2.4.3. Homoscedasticity
3.1.3. key assumptions
3.1.3.1. independence of errors
3.1.3.2. independence of observations
3.2. Spatial regression
3.2.1. if locations matters, conventional regression fails two assumptions
3.2.1.1. first step: **Durbin Joint Test**
3.2.1.1.1. OLS (conventional regression) is statistically flawed?
3.2.2. location may affect
3.2.2.1. dependent variable **Y**
3.2.2.1.1. SAR REGRESSION
3.2.2.2. residuals **ε**
3.2.2.2.1. SEM REGRESSION
3.2.2.3. independent variable **X**
3.2.2.3.1. SLX Regression
3.2.2.4. or **COMBINATION!**
3.2.2.4.1. independent + dependent
3.2.2.4.2. error + dependent
3.2.2.4.3. error + independent
3.2.2.4.4. error + independent + dependent