1. Data Visualization
1.1. Line plots
1.2. Boxplots
1.2.1. Symmetric or Skewed
1.2.2. Tightly or loosely grouped
1.2.3. Turkey Boxplots
1.3. Histograms
1.3.1. Bin widths, frequency density
1.3.2. Left/Rightskewed, unimodal, binodal, multimodal
1.4. Bar charts
1.5. Scatter plots
1.5.1. Bubble plots
1.5.2. Enhanced scatter plots
1.5.3. Scatterplot matrix
1.5.4. Caution: 'Overplotting'
1.6. Heatmap
1.7. Parrallel Coordinate plots
1.7.1. Axes scaling
2. Pre-processsing
2.1. Inconsistent data
2.2. Missing data
2.2.1. MCAR
2.2.2. MAR
2.2.3. MNAR
2.3. Data Cleaning
2.3.1. Data scrubbing
2.3.2. Data discrepancy detection
2.3.3. Data auditing
2.3.4. ETL
2.4. Data Formats
2.4.1. Structured data
2.4.1.1. Relational Database
2.4.1.2. SQL
2.4.1.3. Pandas
2.4.1.4. DataSpreadsheets: Excel, Google sheets, OpenOffice
2.4.2. Semi-structured data
2.4.2.1. CSV
2.4.2.2. HTML
2.4.2.3. XML
2.4.2.4. JSON
2.4.2.5. PDF
2.4.3. Unstructured data
2.4.3.1. Text
3. Machine learning
3.1. Supervised
3.1.1. Classification
3.1.2. Regression
3.1.3. Performance
3.1.3.1. Confusion matrix
3.1.3.2. MSE
3.1.3.3. MAE
3.1.3.4. RMSE
3.1.4. Evaluation
3.1.4.1. Train-validation split
3.1.4.2. K-fold cross
3.1.4.3. Bootstrap
3.2. Unsupervised
3.2.1. Clustering
3.2.2. Association
3.3. Feature selection
3.3.1. Wrapper
3.3.2. Filtering
3.3.3. Embedded
4. Text/String Analysis
4.1. Searching
4.1.1. Exact string match
4.1.2. Approximate string match
4.2. Comparison
4.2.1. Minimum edit distance
4.2.2. Levenshtein distance
4.2.3. N-grams
4.2.4. Jaccard similarity
4.2.5. Sorensen-Dice similarity
4.2.6. Cosine-similarity
4.2.7. Jaro-Winkler similarity
4.3. Pattern matching
4.3.1. regular expresssion(re)
4.4. Pre-processing
4.4.1. Case folding
4.4.2. Tokenisation
4.4.3. Stemming
4.4.4. Lemmatization
4.4.5. Stop word removal
4.4.6. Normalization
4.4.7. Noise removal
4.5. Representaion
4.5.1. BoW
4.5.2. Term Frequency
4.5.3. TF-IDF
5. Classification
5.1. Decision Trees
5.1.1. Tree Induction
5.1.1.1. Multi-way split
5.1.1.2. Binary split
5.1.1.3. Discretisation
5.1.1.4. Binary Decision
5.1.1.5. Node Impurity(entropy)
5.2. K-NN
5.2.1. Optimal K-value
5.2.2. Distance measures
5.2.2.1. Euclidean
5.2.2.2. Pearson coefficient
6. Data Linkage
6.1. Record linkage
6.1.1. Pairwise comparison
6.1.2. Blocking
6.1.2.1. Token
6.1.2.2. Phrases
6.1.2.3. N-grams
6.1.2.4. Prefix,suffix
6.1.2.5. Soundex
6.1.3. Evaluation
6.1.3.1. Percision
6.1.3.2. Recall
7. Social and ethical implication
7.1. BDA
7.1.1. Data extraction
7.1.2. Data commodification
7.1.3. Decision Making
7.1.4. Control and monitoring
7.2. 10 Rules
8. Privacy
8.1. Individual Anonymity
8.1.1. K-anonymity
8.1.2. I-diversity
8.1.3. Location and Tracjectory
8.1.3.1. Inference attacks
8.1.3.2. Obfuscation
8.2. Local and Global
8.2.1. Differential privacy
8.2.1.1. Privacy loss budget
8.2.1.2. Global sensitivity
9. Recomender system
9.1. Item based collaborative filtering
9.2. User-based collaborative filtering
10. Data analytics
10.1. Clustering
10.1.1. K-means
10.1.2. VAT
10.1.3. Heirarchical
10.1.4. Agglomerative
10.2. Dimensionality reduction
10.2.1. PCA
11. Linear Regression
11.1. Interpolation vs Extrapolation
11.2. Residual Analysis
11.2.1. SST
11.2.2. SSR
11.2.3. SSE
11.2.4. r^2
11.3. Multiple Regression
12. Correlation
12.1. Discover relationships
12.2. 'Discovering causailty'(careful)
12.3. Feature ranking
12.4. Pearson: linear correlation
12.5. Mutual Information: non linear correlation
12.6. Entropy
13. Privacy
13.1. Hashing
13.1.1. Non-invertable hash function
13.1.2. Hash encoding for exact matching
13.1.2.1. Dictionary attack