COMP30027 Machine Learning

Get Started. It's Free
or sign up with your email address
Rocket clouds
COMP30027 Machine Learning by Mind Map: COMP30027 Machine Learning

1. Semi-supervised

1.1. Semi-supervised classification

1.1.1. u >> l , unlabelled >> labelled

1.2. Constrained clustering

1.3. Self training / Bootstrapping

1.3.1. Iterative train on L, then bring in confident examples from U

1.3.2. Repeat until L is unchanged

1.4. Co-training

1.4.1. Traing two views, find interection of confidence, etc

1.5. Active learning

1.5.1. The learner poses queries, to be answered by oracle/human

1.5.2. Membership query synthesis

1.5.3. Stream based selecting sampling (choose whether to label)

1.5.4. Pool based selective sampling (choose which instances from pool to label)

1.5.5. Query strategies Least confident classification Margins ie first and second most probable are very close Highest entropy among classifications Query by committe: highest disagreement among classifiers

2. Structured learning

2.1. Instances no longer independent

2.2. Markov chains mu=(A,Pi)

2.2.1. A=transition matrix

2.2.2. Pi=initial state distribution

2.3. Hidden Markov Models

2.3.1. Hidden Markov Model A: transition matrix sum of any row = 1 B: output / observation probabilities sum of any row = 1 Pi: initial state distrbution sum = 1 Previous state dependency assumption Also: observation probability depends only on the state

2.3.2. Problems to solve Evaluation: Given HMM, what is the probability of a particular O sequence? <A,B,Pi> + O -> P(O|A,B,Pi) Easy if we know S1,S2... just crunch numbers Hard if we don't know S1,S2... Decoding: Given HMM and O sequence, what is the most likely state sequence? <A,B,Pi>+O -> S1,S2,... Learning: Given the states S and obs O, what is the most reasonable HMM model <A,B,Pi>? [S]+O -> <A,B,Pi>

2.4. Maximum entropy Markov Models

2.5. Conditional random fields

3. Supervised learning

3.1. Base classifiers

3.1.1. Classification Algorithms Nearest-neighbour Neighbour prototypes Naive Bayes Decision trees Support Vector Machines Evaluation 2-class N-class Issues Baseline/benchmark

3.1.2. Regression Error/loss functions Residual Sum of Squares (RSS) = Sum of Squares due to Error (SSE) Root mean squared Root relative squared error Correlation coefficient Relatively stable so exact choice doesn't necessarily mattr Regularisation L1 L2 Linear Regression Trees Gradient descent Log-linear model Logistic regression Soft-max to convert multi-class to probabilities

3.1.3. Neural Networks Perceptron Training Multi-layered Perceptron Activation functions Continue until stopping condition is met Output layer Training Regularisation Deep learning Many hidden layers Representation learning Convolutional neural networks

3.2. Ensemble

3.2.1. Aggregation methods Voting (discrete classes) Average (continuous)

3.2.2. Generation methods Instance manipulation: produced multiple base classifiers from different samples Feature manipulation: produce multi base classifiers from different feature subsets Class label manipulation ?? Algorithm/hyperparameter manipulation: randomly tweak parameters Randomly subsample the attributes at each node during training (decorrelate the trees)

3.2.3. Bagging Generate new datasets by randomly sampling with replacement from original dataset "Bootstrap Aggregation" Motivation: reduce overfitting Each classifier can't "memorize" the training data REcombine through voting or average

3.2.4. Random Forests Bagging - forest of decision trees Recombine with voting Can engineer features through lienar combinations

3.2.5. Boosting Exaggerate the importance of "hard to classify" instances Iteratively modify weights of incorrect instances

3.2.6. Stacking Vote among many different classifiers Train a meta-classifier over many base classifiers

3.3. General

3.3.1. Partitioning Test/training split Holdout Cross-validation (M-fold) "Leave one out" cross validation Validation partition Used to validate hyperparameters - separate from final test data

3.3.2. Error analysis Confusion matrix

4. Distance/similarity metrics

4.1. Metric definition

4.1.1. NON-NEGATIVE: dist(x,y) >= 0

4.1.2. EQUALITY MEANS IDENTITY: if dist(x,y) = 0 then x = y

4.1.3. SYMMETRIC: dist(x,y) = dist(y,x)

4.1.4. TRIANGLE INEQUALITY: dist (x,z) <= dist(x,y) + dist(y,z)

4.2. Continuous vector

4.2.1. Minkowski distances Manhattan (c=1) Euclidean (c=2) Chebyshev (c -> inf)

4.3. Cosine distance

4.4. Discrete vector

4.4.1. Boolean vector Jacard similiarity Dice similiarity

4.4.2. Hamming distance

4.5. Linguistic

4.5.1. "Edit" distance? (levenstein?")

4.5.2. Hamming distance?

4.5.3. Bag of words?

4.5.4. N-grams

5. General

5.1. Concepts

5.1.1. Structure of ML problem Instances/Exemplars essentially just feature vectors Attributes/features Concepts/labels/classes

5.1.2. Learner: the algorithm

5.1.3. Classifier: the model

5.1.4. Entropy Measure "unevenness" High entropy = more unpredictable Low entropy = more predictable

5.2. Issues

5.2.1. Overfitting

5.2.2. Underfitting

5.2.3. Generalisation beyond training data

5.2.4. Consistency with training data

5.3. Relationships with other fields

5.3.1. Data mining Business applications, practicality, runtime concerns etc

5.3.2. Data science Straddles ML and DM, applications, interpretation/meaning/insights

5.3.3. Machine learning Tends to be concerned with theory, instead of runtime/practicality etc

5.3.4. Philosophy

5.4. Probability

5.4.1. Maximum likelhood estimation Laplacian smoothing factor

5.4.2. Entropy of probability distribution

5.4.3. Pointwise mutual information?

5.5. Parametric vs. non-parametric

5.5.1. Parametric Fixed number of parameters Does not tend to grow in complexity as data grows

5.5.2. Non-parametric Doesn't mean no parameters! Just means doesn't make assumptions about the form of the function (??) Can grow in complexity as the data grows

6. Data management

6.1. Value types

6.1.1. Categorical/Nominal Defined: = Boolean is a special case Sometimes call "discrete", but there is no ordering No relation or ordering among values

6.1.2. Ordinal Discrete values Defined: =, <, > Continuous/numeric values Defined: =, <. >, +, -, *, / Interval Ratio

6.2. Preprocessing

6.2.1. Transformations Discrete "One hot": transform categorical/nominal into set of N boolean features Ordinal: convert to set of (n-1) booleans (with ordering factored in) Continuous Discretization / quantization

6.2.2. Features Normalisation Feature vector normalisation Feature weighting Feature selection Wrapper methods Feature engineering

6.2.3. Dimensionality reduction Principal Component Analysis Reduce dimension while maximally preserving variation Solved using eigenvalue solver (?)

7. Unsupervised learning

7.1. General

7.1.1. Strongly unsupervised Discover classes in the process of learning

7.1.2. Weakly unsupervised Categorize without the aid of pre-labelled instances

7.2. Clustering

7.2.1. General distinctions Exclusive vs overlap Deterministic vs probabilistic Hard vs soft (?) Heirarchical vs partitioning Increment vs batch clustering (?)

7.2.2. K-means Algorithm Select k random seeds Allocate each point to nearest seeds Reassign seeds as cluster centroids Repeat until no reassignment of points Soft k-means Select k random seeds Assign proximity (with stiffness param) to each point Update each cluster centroid to weighted sum of all points weighted by proximity Repeat until stable Properties: overlapping, probabilistic, partitioning, batch clustering Expectation Maximisation

7.2.3. Finite mixture

7.2.4. Cluster evaluation The goal High cohesion/compactness of clusters High separation between clusters Combined measures Unsupervised Minimize Sum of Squared Errors (distances from centroids) Graph based (average distances between points within cluster) Prototype based separation (?) Graph based separation & cohesion (?) Silhouette coefficient Supervised Purity Entropy

7.3. Association

7.3.1. Learning associations between features eg. A & B -> C