Machine Learning

Get Started. It's Free
or sign up with your email address
Machine Learning by Mind Map: Machine Learning

1. Models

1.1. Linear Models

1.1.1. Baseline Model Classification Most Frequent Target Regression Average Target Time Series 당일 예측값 = 전일 Target 값

1.1.2. Predictive Model Simple Regression Model 1 Feature vs 1 Target Multiple Regression Model Multiple Features vs 1 Target Ridge Regression Model Multiple Regression 에 기울기를 조정해서 일반화가 더 잘되게끔 하는 Model Logistic Regression Model 0 / 1 을 확률로 예측하는 Model

1.2. Tree-Based Models

1.2.1. Decision Tree Models Decision Tree Classifier from sklearn.tree import DecisionTreeClassifier Decision Tree Regressor from sklearn.tree import DecisionTreeRegressor

1.2.2. Bagging Models Random Forest Models Random Forest Classifier Random Forest Regressor

1.2.3. Boosting Models Gradient Boosting Gradient Boosting Classifier Gradient Boosting Regressor XGBoost XGB Classifier XGB Regressor Light GBM LGBM Classifier LGBM Regressor

1.3. Pipeline Modeling

1.3.1. from sklearn.pipeline import make_pipeline pipe = make_pipeline( OneHotEncoder(), SimpleImputer(), LogisticRegression(n_jobs=-1) ), y_train)

2. Process

2.1. 1. Reading Data

2.1.1. EDA pip install pandas_profiling import pandas_profiling

2.1.2. Target 분포 확인 Regression Model Check Skewness Log Transform Classification Model Imbalanced Target

2.1.3. Data Wrangling Cleaning Data Gathering Data

2.1.4. Data Preprocessing Data Manipulation Feature Engineering

2.1.5. Preprocessing Methods Remove Outliers 0.05% <= values <= 99.5% Encoders pip install category_encoders Imputers Simple Imputer Iterative Imputer Feature Selection SelectKBest

2.2. 2. Splitting Data

2.2.1. Hold-Out Validation Train / Validation / Test Data Split from sklearn.model_selection import train_test_split

2.2.2. K-Fold Cross Validation Train / Test Data Split from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score

2.3. 3. Hyperparameter Tuning

2.3.1. Problem

2.3.2. Solution Randomized Search CV from sklearn.model_selection import RandomizedSearchCV Grid Search CV from sklearn.model_selection import GridSearchCV

2.4. 4. Evaluation

2.4.1. Regression Model MSE (Mean Squared Error) from sklearn.metrics import mean_squared_error MAE (Mean Absolute Error) from sklearn.metrics import mean_absolute_error RMSE (Root Mean Squared Error) squared root of MSE R2 (Coefficient of Determination) from sklearn.metrics import r2_score

2.4.2. Classification Model 개념 True Positives False Positives True Negatives False Negatives Confusion Matrix from sklearn.metrics import plot_confusion_matrix import matplotlib.pyplot as plt Evaluation Accuracy Precision Recall F1 Classification Report from sklearn.metrics import classification_report Threshold ROC & AUC Threshold Optimization

2.5. 5. Feature & Target Analysis

2.5.1. Bias / Variance Tradeoff Low Bias Train Data (Overfitting) High Variance Test Data High Bias Train Data (Underfitting) Low Variance Test Data

2.5.2. Check Leakage Train / Test Contamination test 에 관한 정보가 train data에 섞여 있는지? Target Leakage 해당 feature가 target을 예측하는데 1대1 관계인지?

2.5.3. 전반적인 Importances Feature Importance Mean Impurity Decrease Drop Column Importance Feature 한개씩 제외하고 나서의 성능 V.S 전체 Feature 에 관한 성능 Permutation Importance pip install eli5

2.5.4. 개별적인 Importances PDP (Partial Dependence Plots) 1 Feature vs Target 2 Features vs Target Shapley Values pip install shap