Online Mind Mapping and Brainstorming

Create your own awesome maps

Online Mind Mapping and Brainstorming

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account? Log In

Machine Learning by Mind Map: Machine Learning
5.0 stars - 1 reviews range from 0 to 5

Machine Learning

Training process

Hints

Implement quickest approach first

More data rules

Can human classify?

Normalisation

Training examples

60% training data

20% validation data

20% test data

Dimension reduction (PCA)

Use when algorithm is slow on full dataset

Use to decrease size required to store data

NOT use to fight over-fitting - regularize instead

Pick smalest dimensions with 99% of variance is retained

Calculate on training set ONLY

Hint: Reduce data to 2D(3D) if we need to plot data

Plot learning curves

Plot learning curves to decide if more data, more features, etc. are likely to help.

Plot error graph depending on number of examples, High error on both training and validation sets even on large training set means high bias, Increase degree of polynom, If error on validation set starts increasing while error on training set is very low and decreasing then this is high variance - overfitting, Tune lamda coefficient in regularisation

Regularisation

Увеличивает значение ценовой функции. Если значения переменных сета растут, то регуляризация увеличивает значение ценовой функции. Сета 0 (коэффициент перед константным = 1 элементом в функции гипотезе) не используется при подсчете

Prevents overfitting

Lambda parameter

Error analysis

Manually examine the examples in cross validation set that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Classification error analysis

Classification effectiveness, F score, Precision, Recall, Metrics, Accuracy

Play with threshold

Supervised

Normal equations

Does not need normalisation of features

Requires maxtrix inversion which is slow

Gradient decent

Linear regression, Number of features larger then number of examples, Small number of features large number of examples, Initialization of theta matrix may be zeros

Logistic regression, Initialization of theta matrix may be zeros

Neural network, Work for all cases but slow to train, Initialization of theta matrix must be random

SVM - support vector machine, Feature scaling is a must (normalization), C parameter, Linear kernel = no kernel, Number of features larger then number of examples, Small number of features large number of examples, Gussian kernel, Sigma, Number of examples in ~times larger then number of featues

Unsupervised

Clustering

K-Means, Algorithm, Select K - desired number of culsters, Required number of cluster may be already known, Elbow method, Select K centroids, Starting centroids may be randomly selected examples, Split examples by clusters based on distance to centroids, If cluster is empty throw it away, Recalculate cluster centroid - average of examples in it, Repeat until cost function is decreasing, Hints, If K ~ 2-10 then algorithm should be executed 50-1000 times to avoid local optima and find lowest possible cost function value, If K ~ 100 there is no sense to run algorithm several times - because first result will be very close to global optima on most of training sets sizes

Anomaly detection

Requires proper statistics

Training, CV and test sets are used, Training set should not have anomalies, CV and test sets shoud have anuomalies

Introduce features to catch correlations between features

CV set is used to choose epsilon parameter

F score should be used to evaluate efficiency

Use when

Too many (thousands and more) normal examples

Few (0-20) anomalies

Different types of anomalies - future anomalies may be different from examples

Use gradient decent instead when

Sufficient number of positive and negative examples

Anomalies have similarties and may and future will be similar to registered (spam for example)

Basic model

Additional features to catch correlations between featues should be introduced manually

Computational cheap - may work on 10k-100k features

May work even if training set is small

Multivariate model

Automatically catch correlations between features

Computaionally more expansive

Must have more examples in training set then features