DM I am dying Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen

Datamining MMSDS

Get Started. It's Free
or sign up with your email address
DM I am dying Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen by Mind Map: DM I am dying  Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen

1. Cluster analysis (hierarchical) Every object belongs to cluster and partent clusters

1.1. strenghts

1.2. Buttom up

1.2.1. Single Link

1.2.2. Complete Link

1.2.3. balanced: Group average

1.2.4. faster: Distance between Centroids

1.3. Top- Down

1.3.1. Problem and Limitation

1.4. Cutting a hierarchical clusterin yields a partitional cluster

2. Exam

2.1. -Wie funktionieren Algos? -Was passiert wenn Paras geändert werden? -Wie reagiert Algo auf komische datenstruktur? -Algos vergleichen, diskutieren -Solutions zu problemen wissen -Rechenaufgaben -Visuelle Algos anwenden

3. Preprocessing Bro but dont process the training set ne

3.1. Errors

3.1.1. What types of sources? Sensors, Wildcodes, bugs in processing code

3.1.1.1. Solution: Remove with domain knowledge or Anomality Detection

3.2. Missing Values

3.2.1. What kind of reasons? Sensors, Info not provided by respondents, Dataloss

3.2.1.1. Solution

3.2.1.1.1. Replace (simple imputation)

3.2.1.1.2. Remove (when missing high and only few rows affected

3.2.1.1.3. Predict (Advanced Imputation) to fill treat missing values as learning problem Target: missing value Training data: instances where feature is there

3.2.2. Random vs not random

3.2.2.1. random: holds information as well! "are you drinking?" subpopulation: "how long pregnant?"

3.2.2.1.1. Imputing, predicting can lead to information loss!

3.3. Unbalanced distribution NEVER balance test set! Problem: Decision tree doesnt find any splitting that improves quality)

3.3.1. Solution Goal: As much and as diverse data as possible

3.3.1.1. Resampling

3.3.1.1.1. Down (problem: we want to use as much data as possible) vs upsampling (we want do use diverse training data to prevent overfitting)

3.3.1.2. Wheighting

3.4. Different Scales

3.4.1. Problem: knn sensitive to scale. Why? Eucledian DIstance

3.4.1.1. Solution

3.4.1.1.1. Normalization

3.5. False Predictors

3.5.1. Def: Target Var included in at A feature might **appear predictive** but is actually **deterministic** based on domain knowledge. Grape:Barbara -> whine:Barara

3.5.1.1. Solution

3.5.1.1.1. False Predictor is seen in...

3.5.1.1.2. learn model and drop suspect until accuracy drops

3.5.1.1.3. correlation = 1 (heatmap)

3.6. Unsupported Data Types

3.6.1. Some algos cant handle some type of data Beispiel: SVM, knn, NN and categorical data doesnt work

3.6.1.1. Cat -> Num

3.6.1.1.1. Ord ->num

3.6.1.1.2. Nom -> num

3.6.1.2. Num -> Ordninal (asso rule learning)

3.6.1.2.1. Bins and Buckets

3.6.1.3. Dates

3.6.1.3.1. formalize, parse

3.6.1.4. Tex2Vec

3.6.1.4.1. Preprocessing: Text cleanup, Tokinization, Stopword removal, Stemming

3.7. High Dimensionality

3.7.1. large number of attributes (x), scalability problem Decision tree is overfitting bc it grows too steep (too many levels of nodes) Naive bayes brauch independent features

3.7.1.1. Principal Component Analysis -> smaller set of new attributes created (as taking the most expressive from liniar combinations of existing attributes)

3.7.1.1.1. Bindes deminsions

3.7.1.2. Feature selection -> subset of attributes selection Reduce coloumns

3.7.1.2.1. Correlation Analysis

3.7.1.2.2. Filter Methodes (use wheighting criterion and select the attributed with highest wheights)

3.7.1.2.3. Wrapper Methodes: Use internal classifier and select the best feature set

3.7.1.3. After Dimensionality reduction: Sampling (Reduces Rows)

3.7.1.3.1. Stratified Sampling

3.7.1.3.2. Kennard Stone Sampling

4. Classification Goal: Classify unseen instances Understand the application domain as human

4.1. Lazy Learning Instance-ased Doesnt build a model Learning is only performed on unseen instances Goal: Classify unseen records as perceicly as possible

4.1.1. Centroid

4.1.1.1. KNN vs Centroids

4.1.1.1.1. Different performance in

4.1.1.1.2. Global (C) vs Local(KNN) model KNN stores all data, C only the mean KNN needs K clalcs for distance, C only C

4.1.2. KNN Algo

4.1.2.1. classifies new data points based on the majority class of their nearest neighbors

4.1.2.1.1. 1. Select a number k (typically an odd number to avoid ties). 2. Calculate the distance (e.g., Euclidean) from the new data point to all existing labeled data points. 3. Identify the k closest neighbors. 4. Assign the most common label among the neighbors to the new data point.

4.1.2.1.2. Most common distance measure: Eucledian

4.2. Eager Learning Goal: Classify unseen records and generate models that might be interpretable by humans

4.2.1. Decision tree

4.2.1.1. How to learn a decision tree?

4.2.1.1.1. Hunts Algo (see tutorial 3 in "Lets build a decision tree")

4.2.1.2. When to split the tree? Depenss on: attribute types # of ways to split (2 vs multiway) Purity

4.2.1.2.1. Nominal X

4.2.1.2.2. Ordinal

4.2.1.2.3. Cont

4.2.1.3. What is the best split? Impurity vs overfitting

4.2.1.3.1. Gini min: 0.0 (pure) max: 1-1/number of classes (impure because equally distributed and least interresting information)

4.2.1.3.2. Entropy and many others

4.2.1.4. Advantages vs Disadvantages

4.2.1.4.1. + inexpensive + fast + easy to interpret + accuracy comparable to other techniques for small datasets

4.2.1.4.2. - only one sigle attribute at a time (Parallel decision boundaries)

4.2.1.4.3. Good Practice

4.2.1.5. Decision tree vs kNN

4.2.1.5.1. Boundaries

4.2.1.5.2. Scale sensitivity

4.2.1.5.3. Runtime and Memory

4.3. Overfitting in Decision trees (fitting on a name)

4.3.1. Does not generalize well on unseen data. Goal: classify unseen examples

4.3.1.1. Occams Razor

4.3.1.1.1. Symptoms

4.3.1.1.2. Causes

4.3.1.1.3. Solution

4.4. Evaluation Metrics

4.4.1. How good is a model in classifying unseen examples

4.4.1.1. Confusion Matrix

4.4.1.1.1. Accuarcy (tut 3)

4.4.1.1.2. Error Rate

4.4.1.1.3. Precision and Recall

4.4.1.1.4. Cost sensitive model

4.4.1.1.5. ROC curves (for Knn and Naive Bayes) (See Tut 3 for interpretation and calc)

4.5. Naive Bayes -> Probabilistic Classification

4.5.1. Anwendung beachten und am ende normalisieren

4.5.1.1. Bayes Classifier

4.5.1.1.1. Given attributes, how likely is class label c? (Binary class)

4.5.1.2. Handling numerical attributes

4.5.1.2.1. discretizise

4.5.1.2.2. normalize and calculate the base P normally

4.5.1.2.3. Use different distribution as density function

4.5.1.3. Missing values

4.5.1.3.1. since multiplication in bayes, the nominator will be 0 :broken_heart: Thats a problem if we add another term

4.5.1.3.2. Unseen record

4.5.1.4. Decision boundary

4.5.1.4.1. soft margins, uncertain areas, random shapes, larger

4.5.1.5. Pro and con

4.5.1.5.1. works well even independece ass violated

4.5.1.5.2. Problem: too many redundant attributes (select subset!)

4.6. Support Vector Machines

4.6.1. for continous attributes But still : Class binary

4.6.1.1. good for high dimensional data

4.6.1.1.1. Goal: Fit a (linear) Hyperplane (desicion boundary)

4.6.1.1.2. Non-linearity

4.6.1.1.3. Optimize hyperplane:

4.6.1.1.4. Strengths and Limitations

4.7. Artificial Neural Networks

4.7.1. Layout: its complicated. watch a video Function, Input/output Bias term, activation value

4.7.1.1. Algo for training ANNs

4.7.2. Types of deep learnin models

4.7.2.1. CNN: image recognition

4.7.2.2. BERT

4.7.2.2.1. Pretrain finetune language models

4.7.2.3. Instruct lanuage models

4.7.2.4. Generative models

4.8. Evaluation Methods How to obtain reliableresults

4.8.1. Is not "how to measure perfomance"

4.8.1.1. Split

4.8.1.1.1. Test vs training

4.8.1.1.2. Holdout Method

4.8.1.1.3. **perfomance estimation** k-fold Cross validation estimates perfomrance. Overfitting risk with hyperpara tuning bc of info spilling from T to T

4.8.1.2. Learing curve

4.8.1.2.1. how does accuracy change?

4.8.1.3. Hyperparameter and Model Selection

4.8.1.3.1. value set b4 learing. influences learning process (# of hidden layers in ANN)

4.9. Validating and comparing models Is NOT Evaluation Methodes this here are no specific techniques but a overall strategy to check if model generalizes well

4.9.1. Overfitting revisited

4.9.1.1. types

4.9.1.1.1. use test partition for training

4.9.1.1.2. tune paras against test partition (k fold cross validation)

4.9.1.1.3. use test in feature constuction

4.9.2. Overtuning problem

4.9.3. Validating a Better(?) model. Is performance better by chance or by design

4.9.3.1. size of test set

4.9.3.1.1. WHat is an error?

4.9.3.2. Occams Razor

4.9.3.2.1. if two models dont significantly perform better, choos simpler one

4.9.3.3. Variance

4.9.3.3.1. sign test

4.9.3.3.2. Descriptives (ST Deviation, pairwise comparison)

4.9.3.3.3. wilcoxon signed rank test

4.9.3.4. Ablation studies

4.9.3.4.1. what happens if leave out a step of the piplene. Is model simpler and stayys just as reliable?

4.10. Ensambles : Parashift -> many simple learners need certain accuracy and diversity in answers. so that ensable gets answer right and prediction on a new example differs also Have diverse base classifiers (in practice these are not independent)

4.10.1. voting between the predictions of the base classifyers: Majority vote (classification) Averaging (Regression)

4.10.1.1. Probability of wrong prediction. (Binom. Likelihood)

4.10.1.1.1. Causes of errors: Biases in data samples -> overfitting

4.10.1.2. Ensamble makes wrong prediction if the majority of classifier makes wrong prediction

4.10.1.2.1. In theory: We can lower Error infinetely if we just add more base learners

4.10.2. Boosting Train set of classifier one after another where later classifiers focus on misclassified examples from earlier learners

4.10.2.1. Idea of Boosting: recalculation ot wheights

4.10.2.1.1. AdaBoost Algo

4.10.2.2. Error rate

4.10.3. Stacking

4.10.3.1. Idea: Metaclassifier

4.10.3.1.1. Problem: Overfitting due to dumb learners (is perfect in training and meta classifyier puts lots of confidence on dumb lerner)

4.10.3.2. Variants

4.10.3.2.1. keep og Att

4.10.3.2.2. Use confidence intervals

4.10.4. Learning with costs Give wheight to some errors

4.10.4.1. MetaCost Algo

4.10.4.1.1. Metacost vs Balancing

4.10.4.2. Cost function on ordinal data

5. Regression

5.1. What is a regression?

5.1.1. Regression vs Classification Discrete -> Continous Variables

5.1.1.1. Classification stuff that I can do with a regression

5.1.1.1.1. KNN Regression

5.1.1.1.2. Regression tree

5.1.1.1.3. ANNs for Regression

5.1.1.1.4. Model Tree

5.1.1.2. Typical Regression stuff

5.1.1.2.1. Lin Regression

5.1.1.2.2. Non Linear Regression

5.1.1.3. Evaluation Metrics *Same scale as Y

5.1.1.3.1. MAE*

5.1.1.3.2. MSE

5.1.1.3.3. RSME*

5.1.1.3.4. Pearsons

5.1.1.3.5. R^2

5.1.1.4. Bias/Variance Tradeoff

5.1.1.4.1. Goal: Learn model that generalizes well to unseen data

6. Cluster Analysis (partitional)

6.1. Def: similar to another, different to other

6.1.1. Used for classification (supervised learnig) or preprocessing, exploration (unsupervised learning)

6.1.1.1. Needs: Algo, proximity measure, measure of quality

6.1.1.1.1. K-Means Clustering algo

6.1.1.1.2. K-Medoids

6.1.1.1.3. Density based clustering

6.1.1.1.4. Proximity measures

6.1.1.1.5. Outlier Detection (which are not extreme values)