DM I am dying Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen

Datamining MMSDS

Get Started. It's Free
or sign up with your email address
DM I am dying Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen by Mind Map: DM I am dying  Grün: algo verstehen Lila: Diskussion/Kontext Algo Blau: Rechnen

1. Exam

1.1. -Wie funktionieren Algos? -Was passiert wenn Paras geändert werden? -Wie reagiert Algo auf komische datenstruktur? -Algos vergleichen, diskutieren -Solutions zu problemen wissen -Rechenaufgaben -Visuelle Algos anwenden

2. Multimodal Data Goal: Encode all types of data with numbers Goal: 1) Focus on the most impoertant features (Reducing dimensionality) 2) Encoding 3) put different modes of data into one model

2.1. Types of data

2.1.1. Images, Audio, Video, Time Series

2.1.1.1. are all encoded in Numbers

2.1.1.1.1. TF-IDF Vectors to Text

2.2. Neural Networks

2.2.1. Output is prediced from the Values at the hidden layer

2.2.1.1. So we can reduce the input layer to the last hidden layer to describe the output (Like PCA)

2.2.1.1.1. 1. We use Pre-trained Networks

2.3. Auto Encoders

2.3.1. Instead of training with same in and output: we add random noise in input, remove the noud and have clean example of output

2.3.1.1. Example: Word2vec: input: random words and noise and output: relations between words Man->WOman same behaviour as King-> Queen

2.3.1.1.1. Example: BERT: One fixed representation per Word. We can mask a word and reconstruct the whole sentnce

2.4. Problem with time series

2.4.1. We want to feed ECG for each patient

2.4.1.1. High dimensional!

2.4.1.1.1. TSFRESH

2.5. Network Data

2.5.1. classify people in network (who is gatkeeper)

2.5.1.1. Methodes that encode this to numbers: node2vec

3. Time series Task: given TS, what is the trend (Nochmal mit notebook überarbeiten)

3.1. Problem:TS = RE+SE+CE

3.1.1. Random effects

3.1.1.1. solution: smoothing

3.1.2. Seasonal effects

3.1.2.1. Exponential Smoothing

3.1.2.2. How do we isolate these?

3.1.2.2.1. By first decomposing (identifyig) these

3.1.3. Cyclical effects

3.1.3.1. What if peridoicity is unknown?

3.1.3.1.1. Assume: Time series is a sum if sine waves (Fourier transformation)

3.1.4. (Missing data)

3.1.4.1. replacing with average is ass

3.1.4.1.1. Solution: Lin inerpolation, replace with pervious, replace with next, Knn imputation

3.2. How to estimate trend curves?

3.2.1. Freehand method (looking) doestn scale Describing

3.2.2. OLS Methode Predicting

3.2.2.1. is a lin function minimizing squarred errors

3.2.3. moving averages

3.2.3.1. upcoming value is the average of the last n

3.2.3.1.1. By first decomposing

3.2.3.2. Exponential smoothing

3.2.3.2.1. smoothing factor 1 = just replace value with itself 0? replace with smoothed past

3.3. Prediction: Autoregressive models

3.3.1. Moving average for prediction: Predict the averages of the last n values Learn wheights

3.3.1.1. predict lin prediction tha learns from the last episodes (lagged variables)

3.3.1.1.1. Prediction itself is not linear Wheight learning works well

3.3.1.2. predicting with exponential smoothing We predict the wheighted average of rhe last valuee and the last prediction Recursive Use exponential smoothing for prediction

3.3.1.3. Double exponential smoothing

3.3.1.4. Triple exponential smoothing

3.4. Evalueation

3.4.1. 10 - fold cross vaidation

3.4.1.1. Dataleakage!

3.4.1.1.1. CV on the most recent period and testing on the past periodes

4. Association Analysis

4.1. are items purchased related? Needs transformtion in binary variables "Browswer Type== Chrome"or discretization problem: many values, small intervals -> low support large intervals -> low confidence

4.1.1. Pearsons correlation coefficient

4.1.2. Correlation Analysis

4.1.3. Association Analysis Given a set of transactions, find rules

4.1.3.1. Rules: predict the occuence of item given occurences of other items in transaction

4.1.3.1.1. Example: {diaper} -> {beer} Given diaper, likely that person will buy beer (Not deterministic!)

4.1.3.1.2. Frequent Itemset: - one or more item set - Not ordered - Order comes from the # of times the product is bought together

4.1.3.1.3. Support count(sigma): frequency of occurence

4.1.3.1.4. Support (s): How frequently the itemset in data. High -> Itemset is more common in T Number of T containing X/total T

4.1.3.1.5. Confidence c : P that consequent items will be purchased given the condition purchased. High= stronger association between items How often items in Y appear in transactions that contain X Support (X or Y)/ Support (X)

4.1.3.1.6. Association Rule: Implication in form X -> Y where both are itemsets Is same as P(Y|X) X = Condition Y= Consequent

4.1.4. Subgroup Discovery: Find all patterns that can explain target var

4.1.4.1. SGD vs Classification: Goal: Accuaracy AND learn as much as possible about elefants

4.1.4.1.1. Algos: EXPLORA

4.1.4.1.2. Metrics:Wheighted relative Accuracy (Accuracy and coverage) P(ST)-P(S)*P(T) If P(S) and P(T) independent, WRAcc = 0 Optimum is P(T)-P^2(T): Vgl with result High P(ST): High coverage Low P(S)-P(ST): accurate subgroup High WRAcc high accuracy (duh)

5. Intro

5.1. Why DM?

5.1.1. Large Quantities of different type of data is collected

5.1.1.1. DM methodes : discover patterns

5.1.1.1.1. and help with decision making

5.1.1.2. Traditional approaches unsuitable bc of:

5.1.1.2.1. large amount of data

5.2. What is DM?

5.2.1. non -trivial process of identifiying valid, novel, usefull and understandable patterns in data

5.3. DM Process

5.3.1. Selection and Profiling

5.3.1.1. What is available, usefull?

5.3.1.1.1. Visualize data and identify problems

5.4. Preprocession

5.4.1. transform data into a representation that is suitable for DM methodes

5.4.1.1. Scales of attributes (num,)

5.4.1.2. number of dimensions (# attributes)

5.5. Transformation

5.5.1. Discretization

5.5.2. Feature subset selection

5.5.3. Embeddings

5.6. Data Mining

5.6.1. Input: preprocessed data output: model

5.6.1.1. Apply DM method

5.6.1.1.1. Evaluate resulting model/patters

5.7. Interpretation

5.7.1. Output: patterns

5.7.1.1. gain knowledge

5.7.1.1.1. use in buisness context

5.8. DM Methodes

5.8.1. desriptive (find patterns)

5.8.1.1. unsupervised

5.8.1.1.1. classification (binary)

5.8.2. Predictive (predict unknown values of variable)

5.8.2.1. supervised

5.8.2.1.1. Regression (countinous)

6. Preprocessing Bro but dont process the training set ne

6.1. Errors

6.1.1. What types of sources? Sensors, Wildcodes, bugs in processing code

6.1.1.1. Solution: Remove outside intervall with domain knowledge (-30, 20 grad)

6.1.1.1.1. anomality detection

6.2. Missing Values

6.2.1. What kind of reasons? Sensors, Info not provided by respondents, Dataloss

6.2.1.1. Solution

6.2.1.1.1. Replace (simple imputation)

6.2.1.1.2. Remove (when missing high and only few rows affected

6.2.1.1.3. Predict (Advanced Imputation) to fill treat missing values as learning problem Target: missing value Training data: instances where feature is there

6.2.2. Random vs not random

6.2.2.1. random: holds information as well! "are you drinking?" subpopulation: "how long pregnant?"

6.2.2.1.1. Imputing, predicting can lead to information loss!

6.3. Unbalanced distribution NEVER balance test set! Problem: Decision tree doesnt find any splitting that improves quality)

6.3.1. Solution Goal: As much and as diverse data as possible

6.3.1.1. Resampling

6.3.1.1.1. Down (problem: we want to use as much data as possible) vs upsampling (we want do use diverse training data to prevent overfitting)

6.3.1.2. Wheighting

6.4. Different Scales

6.4.1. Problem: knn sensitive to scale. Why? Eucledian DIstance

6.4.1.1. Solution

6.4.1.1.1. Normalization

6.5. False Predictors

6.5.1. Def: Target Var included in at A feature might **appear predictive** but is actually **deterministic** based on domain knowledge. Grape:Barbara -> whine:Barara

6.5.1.1. Solution

6.5.1.1.1. False Predictor is seen in...

6.5.1.1.2. learn model and drop suspect until accuracy drops

6.5.1.1.3. correlation = 1 (heatmap)

6.6. Unsupported Data Types

6.6.1. Some algos cant handle some type of data Beispiel: SVM, knn, NN and categorical data doesnt work

6.6.1.1. Cat -> Num

6.6.1.1.1. Ord ->num

6.6.1.1.2. Nom -> num

6.6.1.2. Num -> Ordninal (asso rule learning)

6.6.1.2.1. Bins and Buckets

6.6.1.3. Dates

6.6.1.3.1. formalize, parse

6.6.1.4. Tex2Vec

6.6.1.4.1. Preprocessing: Text cleanup, Tokinization, Stopword removal, Stemming

6.7. High Dimensionality

6.7.1. large number of attributes (x), scalability problem Decision tree is overfitting bc it grows too steep (too many levels of nodes) Naive bayes brauch independent features

6.7.1.1. Principal Component Analysis -> smaller set of new attributes created (as taking the most expressive from liniar combinations of existing attributes)

6.7.1.1.1. Bindes deminsions

6.7.1.2. Feature selection -> subset of attributes selection Reduce coloumns

6.7.1.2.1. Correlation Analysis

6.7.1.2.2. Filter Methodes (use wheighting criterion and select the attributed with highest wheights)

6.7.1.2.3. Wrapper Methodes: Use internal classifier and select the best feature set

6.7.1.3. After Dimensionality reduction: Sampling (Reduces Rows)

6.7.1.3.1. Stratified Sampling

6.7.1.3.2. Kennard Stone Sampling

6.8. Anomality detection (see clustering)

7. Classification I Goal: Classify unseen instances Understand the application domain as human

7.1. Lazy Learning Instance-ased Doesnt build a model Learning is only performed on unseen instances Goal: Classify unseen records as perceicly as possible

7.1.1. KNN Algo

7.1.1.1. classifies new data points based on the majority class of their nearest neighbors

7.1.1.1.1. 1. Select a number k (typically an odd number to avoid ties). 2. Calculate the distance (e.g., Euclidean) from the new data point to all existing labeled data points. 3. Identify the k closest neighbors. 4. Assign the most common label among the neighbors to the new data point.

7.1.1.1.2. Most common distance measure: Eucledian

7.1.2. Centroid

7.1.2.1. KNN vs Centroids

7.1.2.1.1. Different performance in

7.1.2.1.2. Global (C) vs Local(KNN) model KNN stores all data, C only the mean KNN needs K clalcs for distance, C only C

7.2. Eager Learning Goal: Classify unseen records and generate models that might be interpretable by humans

7.2.1. Decision tree

7.2.1.1. How to learn a decision tree?

7.2.1.1.1. Hunts Algo (see tutorial 3 in "Lets build a decision tree")

7.2.1.1.2. Gini Splits (The lower the better)

7.2.1.2. When to split the tree? Depenss on: attribute types # of ways to split (2 vs multiway) Purity

7.2.1.2.1. Nominal X

7.2.1.2.2. Ordinal

7.2.1.2.3. Cont

7.2.1.3. What is the best split? Impurity vs overfitting

7.2.1.3.1. Gini min: 0.0 (pure) max: 1-1/number of classes (impure because equally distributed and least interresting information)

7.2.1.3.2. Entropy and many others

7.2.1.4. Advantages vs Disadvantages

7.2.1.4.1. + inexpensive + fast + easy to interpret + accuracy comparable to other techniques for small datasets

7.2.1.4.2. - only one sigle attribute at a time (Parallel decision boundaries)

7.2.1.4.3. Good Practice

7.2.1.5. Decision tree vs kNN

7.2.1.5.1. Boundaries

7.2.1.5.2. Scale sensitivity

7.2.1.5.3. Runtime and Memory

7.3. Overfitting in Decision trees (fitting on a name)

7.3.1. Does not generalize well on unseen data. Goal: classify unseen examples

7.3.1.1. Occams Razor

7.3.1.1.1. Symptoms

7.3.1.1.2. Causes

7.3.1.1.3. Solution

7.4. Evaluation Metrics

7.4.1. How good is a model in classifying unseen examples

7.4.1.1. Confusion Matrix

7.4.1.1.1. Accuarcy (tut 3)

7.4.1.1.2. Error Rate

7.4.1.1.3. Precision and Recall

7.4.1.1.4. Cost sensitive model

7.4.1.1.5. ROC curves (for Knn and Naive Bayes) (See Tut 3 for interpretation and calc)

8. Classification II

8.1. Naive Bayes -> Probabilistic Classification

8.1.1. Anwendung beachten und am ende normalisieren

8.1.1.1. Bayes Classifier

8.1.1.1.1. Given attributes, how likely is class label c? (Binary class)

8.1.1.2. Handling numerical attributes

8.1.1.2.1. discretizise

8.1.1.2.2. normalize and calculate the base P normally

8.1.1.2.3. Use different distribution as density function

8.1.1.3. Missing values

8.1.1.3.1. since multiplication in bayes, the nominator will be 0 :broken_heart: Thats a problem if we add another term

8.1.1.3.2. Unseen record

8.1.1.4. Decision boundary

8.1.1.4.1. soft margins, uncertain areas, random shapes, larger

8.1.1.5. Pro and con

8.1.1.5.1. works well even independece ass violated

8.1.1.5.2. Problem: too many redundant attributes (select subset!)

8.2. Support Vector Machines

8.2.1. for continous attributes But still : Class binary

8.2.1.1. good for high dimensional data

8.2.1.1.1. Goal: Fit a (linear) Hyperplane (desicion boundary)

8.2.1.1.2. Non-linearity

8.2.1.1.3. Optimize hyperplane:

8.2.1.1.4. Strengths and Limitations

8.3. Artificial Neural Networks

8.3.1. Layout: its complicated. watch a video Function, Input/output Bias term, activation value

8.3.1.1. Algo for training ANNs

8.3.2. Types of deep learnin models

8.3.2.1. CNN: image recognition

8.3.2.2. BERT

8.3.2.2.1. Pretrain finetune language models

8.3.2.3. Instruct lanuage models

8.3.2.4. Generative models

8.4. Evaluation Methods How to obtain reliableresults

8.4.1. Is not "how to measure perfomance"

8.4.1.1. Split

8.4.1.1.1. Test vs training

8.4.1.1.2. Holdout Method

8.4.1.1.3. **perfomance estimation** k-fold Cross validation estimates perfomrance. Overfitting risk with hyperpara tuning bc of info spilling from T to T

8.4.1.2. Learing curve

8.4.1.2.1. how does accuracy change?

8.4.1.3. Hyperparameter and Model Selection

8.4.1.3.1. value set b4 learing. influences learning process (# of hidden layers in ANN)

8.5. Validating and comparing models Is NOT Evaluation Methodes this here are no specific techniques but a overall strategy to check if model generalizes well

8.5.1. Overfitting revisited

8.5.1.1. types

8.5.1.1.1. use test partition for training

8.5.1.1.2. tune paras against test partition (k fold cross validation)

8.5.1.1.3. use test in feature constuction

8.5.2. Overtuning problem

8.5.3. Validating a Better(?) model. Is performance better by chance or by design

8.5.3.1. size of test set

8.5.3.1.1. WHat is an error?

8.5.3.2. Occams Razor

8.5.3.2.1. if two models dont significantly perform better, choos simpler one

8.5.3.3. Variance

8.5.3.3.1. sign test

8.5.3.3.2. Descriptives (ST Deviation, pairwise comparison)

8.5.3.3.3. wilcoxon signed rank test

8.5.3.4. Ablation studies

8.5.3.4.1. what happens if leave out a step of the piplene. Is model simpler and stayys just as reliable?

9. Ensambles : Parashift -> many simple learners need certain accuracy and diversity in answers. so that ensable gets answer right and prediction on a new example differs also Have diverse base classifiers (in practice these are not independent (Dt, naive bayes, knn). we then combine these results in single prediction. But how?

9.1. inverse distance wheighting 1/distance

9.1.1. Probability of wrong prediction. (Binom. Likelihood) gets smaller the more base classfiers we have (in theory) Hard in practice bc baselearners are not independent of each other And we need diversity and this drives the error rate

9.1.1.1. Causes of errors: Biases in data samples -> overfitting

9.1.1.1.1. Solution: Resample data so that only one model would overfit to each noise point and would be overvoted

9.1.2. Ensamble makes wrong prediction if the majority of classifier makes wrong prediction

9.1.2.1. In theory: We can lower Error infinetely if we just add more base learners

9.1.2.1.1. Bin. Likelihood but assumption is independence of base learners. Violated in practice. Why? Because we need diversity also and that drives error rate

9.2. --Boosting-- Train set of classifier one after another where later classifiers focus on misclassified examples from earlier learners

9.2.1. Realization: Multiple iteration with different wheights -sucessifly increase wheight of inncorretly classified examples -so they are more important in next iterations -combine results of all iterations wheighted by respective error measures

9.2.1.1. AdaBoost Algo

9.2.1.1.1. Hypothesis space (descion boundary)

9.2.2. Error rate

9.3. --Stacking-- until now: xoxx -> x 3(x)2(o) -> 2/3 now: Classifier on individual votes

9.3.1. Idea: Metaclassifier

9.3.1.1. Problem: Overfitting due to dumb learners (is perfect in training and meta classifyier puts lots of confidence on dumb lerner)

9.3.1.1.1. Solution:

9.3.2. Variants

9.3.2.1. keep og Attritbutes

9.3.2.1.1. prediction of base learners are additional attributes

9.3.2.2. Use confidence intervals

9.4. Learning with costs Give wheight to some errors

9.4.1. MetaCost Algo Goal: relable traning data with optimal class (minimized cost)

9.4.1.1. Metacost vs Balancing

9.4.1.1.1. unbalanced set: Bias towards larger class, balancing gives more meaningful models Metvost: unbalances the dataset by urpose, labelling more instances with cheap class-> learner is biased towards cheap class (avoids expensive missclassifications, more false alarms)

9.4.2. Cost function on ordinal data

10. Regression

10.1. What is a regression?

10.1.1. Regression vs Classification -Discrete -> Continous Variables -Supervised -> unsupervised (prediction) -predicting known labels -> predicting values that might not be in training data -also other evaluation methodes

10.1.1.1. Classification stuff that I can do with a regression (Interpolation)

10.1.1.1.1. ANNs for Regression non-lin relationships

10.1.1.1.2. Descision tree -> Regression tree

10.1.1.1.3. ANNs for Regression

10.1.1.1.4. Transformation

10.1.1.2. Typical Regression stuff

10.1.1.2.1. Lin Regression (Extrapolation)

10.1.1.2.2. Non Linear Regression

10.1.1.3. Evaluation Metrics

10.1.1.3.1. MAE

10.1.1.3.2. MSE

10.1.1.3.3. RSME*

10.1.1.3.4. Pearsons

10.1.1.3.5. R^2

10.1.1.4. Bias/Variance Tradeoff

10.1.1.4.1. Goal: Learn model that generalizes well to unseen data

11. Cluster Analysis Unsupervised Learning, preprocessing (descriptive, data has no target attribute) vs supervised learning where classes are known beforehand by humans and we use patterns for predictoin of target attribute of new data

11.1. Def: finding groups of objects such that objects are similar to another, different to other -> goal: Get better understanding of (patterns in) data

11.1.1. Cluster analysis (partitional and density based clustering) division of data in non-overlapping subsets such that object is in exactly one subset

11.1.1.1. Needs: Algo (partitional based, density based...), proximity measure (eucledian distance, cosine similrity...) measure of quality (minimal SSE)

11.1.1.1.1. K-Means Clustering algo (partitional) assumes that the clusters are blob or ball-shaped Why? It minimizes the Euclidean distance between points and their cluster centroids.

11.1.1.1.2. DBSCAN Density based clustering Density: # of points within a specified radius

11.1.1.1.3. Proximity measures

11.1.1.1.4. Anomality/ Outlier Detection (which are not extreme values)

11.1.2. Cluster analysis (hierarchical) Every object belongs to cluster and partent clusters so: overlapping Output: classification and hierarchy Produce set of nested clusters organized in hierarchical tree (x achse: bounds of clusters y achse: former distance between merged clusters)

11.1.2.1. strenghts: No assumption about # of clusters + Dendogram can be cut at any point Merging clusters is to find parents (kind of a pattern) Used for Taxonomies

11.1.2.2. Buttom-up: For each instance in a cluster, merge clusters recursivly to find parents

11.1.2.2.1. 1. Have proximity Matrix 2. Put 2 DP that are closest into cluster and repeat 3. After merging steps we have some clusters How do we determine which clusters are clostest so that we can merge them?

11.1.2.3. Top- Down: All instances in one cluster, split clusters recursivly to find children. End: alls clusters contain only one example

11.1.2.3.1. distance metric is similar to complete linkage (use distance to farthest isntance when splitting

11.1.2.4. Cutting a hierarchical clusterin yields a partitional cluster

11.1.2.4.1. we can choos arbitratry numbe of clusters

11.1.2.5. Problem and Limitation

11.1.2.5.1. greedy algo: Descisions taken cannot be undone see what we wrote above for the buttom - up approaches High space and Time complexity ( O(N^3)) complexity mostly