1. Semi-supervised
1.1. Semi-supervised classification
1.1.1. u >> l , unlabelled >> labelled
1.2. Constrained clustering
1.3. Self training / Bootstrapping
1.3.1. Iterative train on L, then bring in confident examples from U
1.3.2. Repeat until L is unchanged
1.4. Co-training
1.4.1. Traing two views, find interection of confidence, etc
1.5. Active learning
1.5.1. The learner poses queries, to be answered by oracle/human
1.5.2. Membership query synthesis
1.5.3. Stream based selecting sampling (choose whether to label)
1.5.4. Pool based selective sampling (choose which instances from pool to label)
1.5.5. Query strategies
1.5.5.1. Least confident classification
1.5.5.2. Margins ie first and second most probable are very close
1.5.5.3. Highest entropy among classifications
1.5.5.4. Query by committe: highest disagreement among classifiers
2. Structured learning
2.1. Instances no longer independent
2.2. Markov chains mu=(A,Pi)
2.2.1. A=transition matrix
2.2.2. Pi=initial state distribution
2.3. Hidden Markov Models
2.3.1. Hidden Markov Model
2.3.1.1. A: transition matrix
2.3.1.1.1. sum of any row = 1
2.3.1.2. B: output / observation probabilities
2.3.1.2.1. sum of any row = 1
2.3.1.3. Pi: initial state distrbution
2.3.1.3.1. sum = 1
2.3.1.4. Previous state dependency assumption
2.3.1.5. Also: observation probability depends only on the state
2.3.2. Problems to solve
2.3.2.1. Evaluation: Given HMM, what is the probability of a particular O sequence?
2.3.2.1.1. <A,B,Pi> + O -> P(O|A,B,Pi)
2.3.2.1.2. Easy if we know S1,S2... just crunch numbers
2.3.2.1.3. Hard if we don't know S1,S2...
2.3.2.2. Decoding: Given HMM and O sequence, what is the most likely state sequence?
2.3.2.2.1. <A,B,Pi>+O -> S1,S2,...
2.3.2.3. Learning: Given the states S and obs O, what is the most reasonable HMM model <A,B,Pi>?
2.3.2.3.1. [S]+O -> <A,B,Pi>
2.4. Maximum entropy Markov Models
2.5. Conditional random fields
3. Supervised learning
3.1. Base classifiers
3.1.1. Classification
3.1.1.1. Algorithms
3.1.1.1.1. Nearest-neighbour
3.1.1.1.2. Neighbour prototypes
3.1.1.1.3. Naive Bayes
3.1.1.1.4. Decision trees
3.1.1.1.5. Support Vector Machines
3.1.1.2. Evaluation
3.1.1.2.1. 2-class
3.1.1.2.2. N-class
3.1.1.2.3. Issues
3.1.1.2.4. Baseline/benchmark
3.1.2. Regression
3.1.2.1. Error/loss functions
3.1.2.1.1. Residual Sum of Squares (RSS) = Sum of Squares due to Error (SSE)
3.1.2.1.2. Root mean squared
3.1.2.1.3. Root relative squared error
3.1.2.1.4. Correlation coefficient
3.1.2.1.5. Relatively stable so exact choice doesn't necessarily mattr
3.1.2.2. Regularisation
3.1.2.2.1. L1
3.1.2.2.2. L2
3.1.2.3. Linear Regression
3.1.2.3.1. Trees
3.1.2.3.2. Gradient descent
3.1.2.4. Log-linear model
3.1.2.5. Logistic regression
3.1.2.5.1. Soft-max to convert multi-class to probabilities
3.1.3. Neural Networks
3.1.3.1. Perceptron
3.1.3.1.1. Training
3.1.3.2. Multi-layered Perceptron
3.1.3.2.1. Activation functions
3.1.3.2.2. Continue until stopping condition is met
3.1.3.2.3. Output layer
3.1.3.2.4. Training
3.1.3.2.5. Regularisation
3.1.3.3. Deep learning
3.1.3.3.1. Many hidden layers
3.1.3.3.2. Representation learning
3.1.3.3.3. Convolutional neural networks
3.2. Ensemble
3.2.1. Aggregation methods
3.2.1.1. Voting (discrete classes)
3.2.1.2. Average (continuous)
3.2.2. Generation methods
3.2.2.1. Instance manipulation: produced multiple base classifiers from different samples
3.2.2.2. Feature manipulation: produce multi base classifiers from different feature subsets
3.2.2.3. Class label manipulation ??
3.2.2.4. Algorithm/hyperparameter manipulation: randomly tweak parameters
3.2.2.4.1. Randomly subsample the attributes at each node during training (decorrelate the trees)
3.2.3. Bagging
3.2.3.1. Generate new datasets by randomly sampling with replacement from original dataset
3.2.3.2. "Bootstrap Aggregation"
3.2.3.3. Motivation: reduce overfitting
3.2.3.3.1. Each classifier can't "memorize" the training data
3.2.3.4. REcombine through voting or average
3.2.4. Random Forests
3.2.4.1. Bagging - forest of decision trees
3.2.4.2. Recombine with voting
3.2.4.3. Can engineer features through lienar combinations
3.2.5. Boosting
3.2.5.1. Exaggerate the importance of "hard to classify" instances
3.2.5.2. Iteratively modify weights of incorrect instances
3.2.6. Stacking
3.2.6.1. Vote among many different classifiers
3.2.6.2. Train a meta-classifier over many base classifiers
3.3. General
3.3.1. Partitioning
3.3.1.1. Test/training split
3.3.1.1.1. Holdout
3.3.1.1.2. Cross-validation (M-fold)
3.3.1.1.3. "Leave one out" cross validation
3.3.1.2. Validation partition
3.3.1.2.1. Used to validate hyperparameters - separate from final test data
3.3.2. Error analysis
3.3.2.1. Confusion matrix
4. Distance/similarity metrics
4.1. Metric definition
4.1.1. NON-NEGATIVE: dist(x,y) >= 0
4.1.2. EQUALITY MEANS IDENTITY: if dist(x,y) = 0 then x = y
4.1.3. SYMMETRIC: dist(x,y) = dist(y,x)
4.1.4. TRIANGLE INEQUALITY: dist (x,z) <= dist(x,y) + dist(y,z)
4.2. Continuous vector
4.2.1. Minkowski distances
4.2.1.1. Manhattan (c=1)
4.2.1.2. Euclidean (c=2)
4.2.1.3. Chebyshev (c -> inf)
4.3. Cosine distance
4.4. Discrete vector
4.4.1. Boolean vector
4.4.1.1. Jacard similiarity
4.4.1.2. Dice similiarity
4.4.2. Hamming distance
4.5. Linguistic
4.5.1. "Edit" distance? (levenstein?")
4.5.2. Hamming distance?
4.5.3. Bag of words?
4.5.4. N-grams
5. General
5.1. Concepts
5.1.1. Structure of ML problem
5.1.1.1. Instances/Exemplars
5.1.1.1.1. essentially just feature vectors
5.1.1.2. Attributes/features
5.1.1.3. Concepts/labels/classes
5.1.2. Learner: the algorithm
5.1.3. Classifier: the model
5.1.4. Entropy
5.1.4.1. Measure "unevenness"
5.1.4.2. High entropy = more unpredictable
5.1.4.3. Low entropy = more predictable
5.2. Issues
5.2.1. Overfitting
5.2.2. Underfitting
5.2.3. Generalisation beyond training data
5.2.4. Consistency with training data
5.3. Relationships with other fields
5.3.1. Data mining
5.3.1.1. Business applications, practicality, runtime concerns etc
5.3.2. Data science
5.3.2.1. Straddles ML and DM, applications, interpretation/meaning/insights
5.3.3. Machine learning
5.3.3.1. Tends to be concerned with theory, instead of runtime/practicality etc
5.3.4. Philosophy
5.4. Probability
5.4.1. Maximum likelhood estimation
5.4.1.1. Laplacian smoothing factor
5.4.2. Entropy of probability distribution
5.4.3. Pointwise mutual information?
5.5. Parametric vs. non-parametric
5.5.1. Parametric
5.5.1.1. Fixed number of parameters
5.5.1.2. Does not tend to grow in complexity as data grows
5.5.2. Non-parametric
5.5.2.1. Doesn't mean no parameters!
5.5.2.2. Just means doesn't make assumptions about the form of the function (??)
5.5.2.3. Can grow in complexity as the data grows
6. Data management
6.1. Value types
6.1.1. Categorical/Nominal
6.1.1.1. Defined: =
6.1.1.2. Boolean is a special case
6.1.1.3. Sometimes call "discrete", but there is no ordering
6.1.1.4. No relation or ordering among values
6.1.2. Ordinal
6.1.2.1. Discrete values
6.1.2.1.1. Defined: =, <, >
6.1.2.2. Continuous/numeric values
6.1.2.2.1. Defined: =, <. >, +, -, *, /
6.1.2.2.2. Interval
6.1.2.2.3. Ratio
6.2. Preprocessing
6.2.1. Transformations
6.2.1.1. Discrete
6.2.1.1.1. "One hot": transform categorical/nominal into set of N boolean features
6.2.1.1.2. Ordinal: convert to set of (n-1) booleans (with ordering factored in)
6.2.1.2. Continuous
6.2.1.2.1. Discretization / quantization
6.2.2. Features
6.2.2.1. Normalisation
6.2.2.1.1. Feature vector normalisation
6.2.2.1.2. Feature weighting
6.2.2.2. Feature selection
6.2.2.2.1. Wrapper methods
6.2.2.3. Feature engineering
6.2.3. Dimensionality reduction
6.2.3.1. Principal Component Analysis
6.2.3.1.1. Reduce dimension while maximally preserving variation
6.2.3.1.2. Solved using eigenvalue solver (?)
7. Unsupervised learning
7.1. General
7.1.1. Strongly unsupervised
7.1.1.1. Discover classes in the process of learning
7.1.2. Weakly unsupervised
7.1.2.1. Categorize without the aid of pre-labelled instances
7.2. Clustering
7.2.1. General distinctions
7.2.1.1. Exclusive vs overlap
7.2.1.2. Deterministic vs probabilistic
7.2.1.3. Hard vs soft (?)
7.2.1.4. Heirarchical vs partitioning
7.2.1.5. Increment vs batch clustering (?)
7.2.2. K-means
7.2.2.1. Algorithm
7.2.2.1.1. Select k random seeds
7.2.2.1.2. Allocate each point to nearest seeds
7.2.2.1.3. Reassign seeds as cluster centroids
7.2.2.1.4. Repeat until no reassignment of points
7.2.2.2. Soft k-means
7.2.2.2.1. Select k random seeds
7.2.2.2.2. Assign proximity (with stiffness param) to each point
7.2.2.2.3. Update each cluster centroid to weighted sum of all points weighted by proximity
7.2.2.2.4. Repeat until stable
7.2.2.2.5. Properties: overlapping, probabilistic, partitioning, batch clustering
7.2.2.2.6. Expectation Maximisation
7.2.3. Finite mixture
7.2.4. Cluster evaluation
7.2.4.1. The goal
7.2.4.1.1. High cohesion/compactness of clusters
7.2.4.1.2. High separation between clusters
7.2.4.1.3. Combined measures
7.2.4.2. Unsupervised
7.2.4.2.1. Minimize Sum of Squared Errors (distances from centroids)
7.2.4.2.2. Graph based (average distances between points within cluster)
7.2.4.2.3. Prototype based separation (?)
7.2.4.2.4. Graph based separation & cohesion (?)
7.2.4.2.5. Silhouette coefficient
7.2.4.3. Supervised
7.2.4.3.1. Purity
7.2.4.3.2. Entropy
7.3. Association
7.3.1. Learning associations between features eg. A & B -> C