with our free apps for iPhone, iPad and Android

Get Started

Already have an account?

Log In

Machine Learning
by Roman Pavlov
# Machine Learning

## Training process

### Hints

### Normalisation

### Training examples

### Dimension reduction (PCA)

### Plot learning curves

### Regularisation

### Error analysis

### Classification error analysis

## Supervised

### Normal equations

### Gradient decent

## Unsupervised

### Clustering

## Anomaly detection

### Requires proper statistics

### Use when

### Use gradient decent instead when

### Basic model

### Multivariate model

5.0 stars - 1 reviews
range from 0 to 5

Implement quickest approach first

More data rules

Can human classify?

60% training data

20% validation data

20% test data

Use when algorithm is slow on full dataset

Use to decrease size required to store data

NOT use to fight over-fitting - regularize instead

Pick smalest dimensions with 99% of variance is retained

Calculate on training set ONLY

Hint: Reduce data to 2D(3D) if we need to plot data

Plot learning curves to decide if more data, more features, etc. are likely to help.

Plot error graph depending on number of examples, High error on both training and validation sets even on large training set means high bias, Increase degree of polynom, If error on validation set starts increasing while error on training set is very low and decreasing then this is high variance - overfitting, Tune lamda coefficient in regularisation

Увеличивает значение ценовой функции. Если значения переменных сета растут, то регуляризация увеличивает значение ценовой функции. Сета 0 (коэффициент перед константным = 1 элементом в функции гипотезе) не используется при подсчете

Prevents overfitting

Lambda parameter

Manually examine the examples in cross validation set that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Classification effectiveness, F score, Precision, Recall, Metrics, Accuracy

Play with threshold

Does not need normalisation of features

Requires maxtrix inversion which is slow

Linear regression, Number of features larger then number of examples, Small number of features large number of examples, Initialization of theta matrix may be zeros

Logistic regression, Initialization of theta matrix may be zeros

Neural network, Work for all cases but slow to train, Initialization of theta matrix must be random

SVM - support vector machine, Feature scaling is a must (normalization), C parameter, Linear kernel = no kernel, Number of features larger then number of examples, Small number of features large number of examples, Gussian kernel, Sigma, Number of examples in ~times larger then number of featues

K-Means, Algorithm, Select K - desired number of culsters, Required number of cluster may be already known, Elbow method, Select K centroids, Starting centroids may be randomly selected examples, Split examples by clusters based on distance to centroids, If cluster is empty throw it away, Recalculate cluster centroid - average of examples in it, Repeat until cost function is decreasing, Hints, If K ~ 2-10 then algorithm should be executed 50-1000 times to avoid local optima and find lowest possible cost function value, If K ~ 100 there is no sense to run algorithm several times - because first result will be very close to global optima on most of training sets sizes

Training, CV and test sets are used, Training set should not have anomalies, CV and test sets shoud have anuomalies

Introduce features to catch correlations between features

CV set is used to choose epsilon parameter

F score should be used to evaluate efficiency

Too many (thousands and more) normal examples

Few (0-20) anomalies

Different types of anomalies - future anomalies may be different from examples

Sufficient number of positive and negative examples

Anomalies have similarties and may and future will be similar to registered (spam for example)

Additional features to catch correlations between featues should be introduced manually

Computational cheap - may work on 10k-100k features

May work even if training set is small

Automatically catch correlations between features

Computaionally more expansive

Must have more examples in training set then features