Get Started. It's Free
or sign up with your email address
EoDP by Mind Map: EoDP

1. Data Visualization

1.1. Line plots

1.2. Boxplots

1.2.1. Symmetric or Skewed

1.2.2. Tightly or loosely grouped

1.2.3. Turkey Boxplots

1.3. Histograms

1.3.1. Bin widths, frequency density

1.3.2. Left/Rightskewed, unimodal, binodal, multimodal

1.4. Bar charts

1.5. Scatter plots

1.5.1. Bubble plots

1.5.2. Enhanced scatter plots

1.5.3. Scatterplot matrix

1.5.4. Caution: 'Overplotting'

1.6. Heatmap

1.7. Parrallel Coordinate plots

1.7.1. Axes scaling

2. Pre-processsing

2.1. Inconsistent data

2.2. Missing data

2.2.1. MCAR

2.2.2. MAR

2.2.3. MNAR

2.3. Data Cleaning

2.3.1. Data scrubbing

2.3.2. Data discrepancy detection

2.3.3. Data auditing

2.3.4. ETL

2.4. Data Formats

2.4.1. Structured data

2.4.1.1. Relational Database

2.4.1.2. SQL

2.4.1.3. Pandas

2.4.1.4. DataSpreadsheets: Excel, Google sheets, OpenOffice

2.4.2. Semi-structured data

2.4.2.1. CSV

2.4.2.2. HTML

2.4.2.3. XML

2.4.2.4. JSON

2.4.2.5. PDF

2.4.3. Unstructured data

2.4.3.1. Text

3. Machine learning

3.1. Supervised

3.1.1. Classification

3.1.2. Regression

3.1.3. Performance

3.1.3.1. Confusion matrix

3.1.3.2. MSE

3.1.3.3. MAE

3.1.3.4. RMSE

3.1.4. Evaluation

3.1.4.1. Train-validation split

3.1.4.2. K-fold cross

3.1.4.3. Bootstrap

3.2. Unsupervised

3.2.1. Clustering

3.2.2. Association

3.3. Feature selection

3.3.1. Wrapper

3.3.2. Filtering

3.3.3. Embedded

4. Text/String Analysis

4.1. Searching

4.1.1. Exact string match

4.1.2. Approximate string match

4.2. Comparison

4.2.1. Minimum edit distance

4.2.2. Levenshtein distance

4.2.3. N-grams

4.2.4. Jaccard similarity

4.2.5. Sorensen-Dice similarity

4.2.6. Cosine-similarity

4.2.7. Jaro-Winkler similarity

4.3. Pattern matching

4.3.1. regular expresssion(re)

4.4. Pre-processing

4.4.1. Case folding

4.4.2. Tokenisation

4.4.3. Stemming

4.4.4. Lemmatization

4.4.5. Stop word removal

4.4.6. Normalization

4.4.7. Noise removal

4.5. Representaion

4.5.1. BoW

4.5.2. Term Frequency

4.5.3. TF-IDF

5. Classification

5.1. Decision Trees

5.1.1. Tree Induction

5.1.1.1. Multi-way split

5.1.1.2. Binary split

5.1.1.3. Discretisation

5.1.1.4. Binary Decision

5.1.1.5. Node Impurity(entropy)

5.2. K-NN

5.2.1. Optimal K-value

5.2.2. Distance measures

5.2.2.1. Euclidean

5.2.2.2. Pearson coefficient

6. Data Linkage

6.1. Record linkage

6.1.1. Pairwise comparison

6.1.2. Blocking

6.1.2.1. Token

6.1.2.2. Phrases

6.1.2.3. N-grams

6.1.2.4. Prefix,suffix

6.1.2.5. Soundex

6.1.3. Evaluation

6.1.3.1. Percision

6.1.3.2. Recall

7. Social and ethical implication

7.1. BDA

7.1.1. Data extraction

7.1.2. Data commodification

7.1.3. Decision Making

7.1.4. Control and monitoring

7.2. 10 Rules

8. Privacy

8.1. Individual Anonymity

8.1.1. K-anonymity

8.1.2. I-diversity

8.1.3. Location and Tracjectory

8.1.3.1. Inference attacks

8.1.3.2. Obfuscation

8.2. Local and Global

8.2.1. Differential privacy

8.2.1.1. Privacy loss budget

8.2.1.2. Global sensitivity

9. Recomender system

9.1. Item based collaborative filtering

9.2. User-based collaborative filtering

10. Data analytics

10.1. Clustering

10.1.1. K-means

10.1.2. VAT

10.1.3. Heirarchical

10.1.4. Agglomerative

10.2. Dimensionality reduction

10.2.1. PCA

11. Linear Regression

11.1. Interpolation vs Extrapolation

11.2. Residual Analysis

11.2.1. SST

11.2.2. SSR

11.2.3. SSE

11.2.4. r^2

11.3. Multiple Regression

12. Correlation

12.1. Discover relationships

12.2. 'Discovering causailty'(careful)

12.3. Feature ranking

12.4. Pearson: linear correlation

12.5. Mutual Information: non linear correlation

12.6. Entropy

13. Privacy

13.1. Hashing

13.1.1. Non-invertable hash function

13.1.2. Hash encoding for exact matching

13.1.2.1. Dictionary attack

13.2. Salting

13.3. Frequency attack

13.4. Bloom Filters