1. 5. Evaluate the model
1.1. Evaluate the models using a validation approach and a validation data set
1.2. Determine confusion matrix values for classification problems.
1.3. Identify methods for k-fold cross-validation if that approach is used.
1.4. Further tune hyperparameters for optimal performance
1.5. Compare the machine learning model to the baseline model or heuristic
2. 4. Train the model
2.1. Argument
2.2. Example
2.3. Sources
3. 6. Tune parameters
4. 7. Get predictions
5. 1. Gather the data
5.1. Understand and identify the business problem (Setting specific, quantifiable goals will help realize measurable ROI from the machine learning project instead of simply implementing it as a proof-of-concept that'll be tossed aside later. The goals should be related to the business objectives and not just to machine learning. While machine learning-specific measures -- such as precision, accuracy, recall and mean squared error -- can be included in the metrics, more specific business-relevant key performance indicators (KPIs) are better.)
5.1.1. What's the business objective that requires a cognitive solution?
5.1.2. What parts of the solution are cognitive, and what aren't?
5.1.3. Have all the necessary technical, business and deployment issues been addressed?
5.1.4. What are the defined "success" criteria for the project?
5.1.5. How can the project be staged in iterative sprints?
5.1.6. Are there any special requirements for transparency, explainability or bias reduction?
5.1.7. What are the ethical considerations?
5.1.8. What are the acceptable parameters for accuracy, precision and confusion matrix values?
5.1.9. What are the expected inputs to the model and the expected outputs?
5.1.10. What are the characteristics of the problem being solved? Is this a classification, regression or clustering problem?
5.1.11. What is the "heuristic" -- the quick-and-dirty approach to solving the problem that doesn't require machine learning? How much better than the heuristic does the model need to be?
5.1.12. How will the benefits of the model be measured?
5.2. Identify your data needs and determine whether the data is in proper shape for the machine learning project. The focus should be on data identification, initial collection, requirements, quality identification, insights and potentially interesting aspects that are worth further investigation.
5.2.1. manage sources
5.2.1.1. Where are the sources of the data that's needed for training the model?
5.2.1.2. What quantity of data is needed for the machine learning project?
5.2.1.3. What is the current quantity and quality of training data?
5.2.1.4. How are the test set data and training set data being split?
5.2.1.5. For supervised learning tasks, is there a way to label that data?
5.2.1.6. Can pre-trained models be used?
5.2.2. Ingest data
5.2.2.1. Where is the operational and training data located?
5.2.2.2. Are there special needs for accessing real-time data on edge devices or in more difficult-to-reach places?
6. 2. Prepare the data: Data preparation is often referred to informally as data prep. It's also known as data wrangling, although some practitioners use that term in a narrower sense to refer to cleansing, structuring and transforming data as part of the overall data preparation process, distinguishing it from the data pre-processing stage
6.1. cleansing, aggregation, augmentation, labeling, normalization and transformation as well as any other activities for structured, unstructured and semi-structured data: Missing value; correcting or deleting errors or noise; Standardize formats across different data sources;
6.2. categorical data
6.3. feature engineering: refers to the process by which we choose the important features (or columns) to look at, and make the appropriate transformations to prepare our data for our model. After we test our model on the data we have, we might go back and reengineer features to see if we get a better result
6.3.1. Normalizing or standardizing the data
6.3.2. Augmenting the data by adding new columns
6.3.3. Removing unnecessary columns
7. 3. Select a model
7.1. regeression
7.1.1. 200x the number of features
7.2. Classification/ binary or multi-class
7.2.1. Minimum rows: 50x number features