Machine Learning

Get Started. It's Free
or sign up with your email address
Machine Learning by Mind Map: Machine Learning

1. Models

1.1. Linear Models

1.1.1. Baseline Model

1.1.1.1. Classification

1.1.1.1.1. Most Frequent Target

1.1.1.2. Regression

1.1.1.2.1. Average Target

1.1.1.3. Time Series

1.1.1.3.1. 당일 예측값 = 전일 Target 값

1.1.2. Predictive Model

1.1.2.1. Simple Regression Model

1.1.2.1.1. 1 Feature vs 1 Target

1.1.2.2. Multiple Regression Model

1.1.2.2.1. Multiple Features vs 1 Target

1.1.2.3. Ridge Regression Model

1.1.2.3.1. Multiple Regression 에 기울기를 조정해서 일반화가 더 잘되게끔 하는 Model

1.1.2.4. Logistic Regression Model

1.1.2.4.1. 0 / 1 을 확률로 예측하는 Model

1.2. Tree-Based Models

1.2.1. Decision Tree Models

1.2.1.1. Decision Tree Classifier

1.2.1.1.1. from sklearn.tree import DecisionTreeClassifier

1.2.1.2. Decision Tree Regressor

1.2.1.2.1. from sklearn.tree import DecisionTreeRegressor

1.2.2. Bagging Models

1.2.2.1. Random Forest Models

1.2.2.1.1. Random Forest Classifier

1.2.2.1.2. Random Forest Regressor

1.2.3. Boosting Models

1.2.3.1. Gradient Boosting

1.2.3.1.1. Gradient Boosting Classifier

1.2.3.1.2. Gradient Boosting Regressor

1.2.3.2. XGBoost

1.2.3.2.1. XGB Classifier

1.2.3.2.2. XGB Regressor

1.2.3.3. Light GBM

1.2.3.3.1. LGBM Classifier

1.2.3.3.2. LGBM Regressor

1.3. Pipeline Modeling

1.3.1. from sklearn.pipeline import make_pipeline

1.3.1.1. pipe = make_pipeline( OneHotEncoder(), SimpleImputer(), LogisticRegression(n_jobs=-1) ) pipe.fit(X_train, y_train)

2. Process

2.1. 1. Reading Data

2.1.1. EDA

2.1.1.1. pip install pandas_profiling

2.1.1.1.1. import pandas_profiling

2.1.2. Target 분포 확인

2.1.2.1. Regression Model

2.1.2.1.1. Check Skewness

2.1.2.1.2. Log Transform

2.1.2.2. Classification Model

2.1.2.2.1. Imbalanced Target

2.1.3. Data Wrangling

2.1.3.1. Cleaning Data

2.1.3.2. Gathering Data

2.1.4. Data Preprocessing

2.1.4.1. Data Manipulation

2.1.4.2. Feature Engineering

2.1.5. Preprocessing Methods

2.1.5.1. Remove Outliers

2.1.5.1.1. 0.05% <= values <= 99.5%

2.1.5.2. Encoders

2.1.5.2.1. pip install category_encoders

2.1.5.3. Imputers

2.1.5.3.1. Simple Imputer

2.1.5.3.2. Iterative Imputer

2.1.5.4. Feature Selection

2.1.5.4.1. SelectKBest

2.2. 2. Splitting Data

2.2.1. Hold-Out Validation

2.2.1.1. Train / Validation / Test Data Split

2.2.1.1.1. from sklearn.model_selection import train_test_split

2.2.2. K-Fold Cross Validation

2.2.2.1. Train / Test Data Split

2.2.2.1.1. from sklearn.model_selection import train_test_split

2.2.2.1.2. from sklearn.model_selection import cross_val_score

2.3. 3. Hyperparameter Tuning

2.3.1. Problem

2.3.1.1. https://i.stack.imgur.com/rpqa6.jpg

2.3.2. Solution

2.3.2.1. Randomized Search CV

2.3.2.1.1. from sklearn.model_selection import RandomizedSearchCV

2.3.2.2. Grid Search CV

2.3.2.2.1. from sklearn.model_selection import GridSearchCV

2.4. 4. Evaluation

2.4.1. Regression Model

2.4.1.1. MSE (Mean Squared Error)

2.4.1.1.1. from sklearn.metrics import mean_squared_error

2.4.1.2. MAE (Mean Absolute Error)

2.4.1.2.1. from sklearn.metrics import mean_absolute_error

2.4.1.3. RMSE (Root Mean Squared Error)

2.4.1.3.1. squared root of MSE

2.4.1.4. R2 (Coefficient of Determination)

2.4.1.4.1. from sklearn.metrics import r2_score

2.4.2. Classification Model

2.4.2.1. 개념

2.4.2.1.1. True Positives

2.4.2.1.2. False Positives

2.4.2.1.3. True Negatives

2.4.2.1.4. False Negatives

2.4.2.2. Confusion Matrix

2.4.2.2.1. from sklearn.metrics import plot_confusion_matrix import matplotlib.pyplot as plt

2.4.2.3. Evaluation

2.4.2.3.1. Accuracy

2.4.2.3.2. Precision

2.4.2.3.3. Recall

2.4.2.3.4. F1

2.4.2.4. Classification Report

2.4.2.4.1. from sklearn.metrics import classification_report

2.4.2.5. Threshold

2.4.2.5.1. ROC & AUC

2.4.2.5.2. Threshold Optimization

2.5. 5. Feature & Target Analysis

2.5.1. Bias / Variance Tradeoff

2.5.1.1. Low Bias Train Data (Overfitting)

2.5.1.1.1. High Variance Test Data

2.5.1.2. High Bias Train Data (Underfitting)

2.5.1.2.1. Low Variance Test Data

2.5.2. Check Leakage

2.5.2.1. Train / Test Contamination

2.5.2.1.1. test 에 관한 정보가 train data에 섞여 있는지?

2.5.2.2. Target Leakage

2.5.2.2.1. 해당 feature가 target을 예측하는데 1대1 관계인지?

2.5.3. 전반적인 Importances

2.5.3.1. Feature Importance

2.5.3.1.1. Mean Impurity Decrease

2.5.3.2. Drop Column Importance

2.5.3.2.1. Feature 한개씩 제외하고 나서의 성능 V.S 전체 Feature 에 관한 성능

2.5.3.3. Permutation Importance

2.5.3.3.1. pip install eli5

2.5.4. 개별적인 Importances

2.5.4.1. PDP (Partial Dependence Plots)

2.5.4.1.1. 1 Feature vs Target

2.5.4.1.2. 2 Features vs Target

2.5.4.2. Shapley Values

2.5.4.2.1. pip install shap