Data Mining Green Box: Algo Blue Box: Calclulation Violet Box: Discussion

Datamining MMSDS

Jetzt loslegen. Gratis!
oder registrieren mit Ihrer E-Mail-Adresse
Data Mining Green Box: Algo Blue Box: Calclulation Violet Box: Discussion von Mind Map: Data Mining Green Box: Algo Blue Box: Calclulation Violet Box: Discussion

1. Multimodal Data Goal: Encode all types of data with numbers Goal: 1) Focus on the most impoertant features (Reducing dimensionality) 2) Encoding 3) put different modes of data into one model

1.1. Types of data

1.1.1. Images, Audio, Video, Time Series

1.1.1.1. are all encoded in Numbers

1.1.1.1.1. TF-IDF Vectors to Text

1.2. Neural Networks

1.2.1. Output is prediced from the Values at the hidden layer

1.2.1.1. So we can reduce the input layer to the last hidden layer to describe the output (Like PCA)

1.2.1.1.1. 1. We use Pre-trained Networks

1.3. Auto Encoders

1.3.1. Instead of training with same in and output: we add random noise in input, remove the noud and have clean example of output

1.3.1.1. Example: Word2vec: input: random words and noise and output: relations between words Man->WOman same behaviour as King-> Queen

1.3.1.1.1. Example: BERT: One fixed representation per Word. We can mask a word and reconstruct the whole sentnce

1.4. Problem with time series

1.4.1. We want to feed ECG for each patient

1.4.1.1. High dimensional!

1.4.1.1.1. TSFRESH

1.5. Network Data

1.5.1. classify people in network (who is gatkeeper)

1.5.1.1. Methodes that encode this to numbers: node2vec

2. Time series Task: given TS, what is the trend (Nochmal mit notebook überarbeiten)

2.1. Problem:TS = RE+SE+CE

2.1.1. Random effects

2.1.1.1. solution: smoothing

2.1.2. Seasonal effects

2.1.2.1. Exponential Smoothing

2.1.2.2. How do we isolate these?

2.1.2.2.1. By first decomposing (identifyig) these

2.1.3. Cyclical effects

2.1.3.1. What if peridoicity is unknown?

2.1.3.1.1. Assume: Time series is a sum if sine waves (Fourier transformation)

2.1.4. (Missing data)

2.1.4.1. replacing with average is ass

2.1.4.1.1. Solution: Lin inerpolation, replace with pervious, replace with next, Knn imputation

2.2. How to estimate trend curves?

2.2.1. Freehand method (looking) doestn scale Describing

2.2.2. OLS Methode Predicting

2.2.2.1. is a lin function minimizing squarred errors

2.2.3. moving averages

2.2.3.1. upcoming value is the average of the last n

2.2.3.1.1. By first decomposing

2.2.3.2. Exponential smoothing

2.2.3.2.1. smoothing factor 1 = just replace value with itself 0? replace with smoothed past

2.3. Prediction: Autoregressive models

2.3.1. Moving average for prediction: Predict the averages of the last n values Learn wheights

2.3.1.1. predict lin prediction tha learns from the last episodes (lagged variables)

2.3.1.1.1. Prediction itself is not linear Wheight learning works well

2.3.1.2. predicting with exponential smoothing We predict the wheighted average of rhe last valuee and the last prediction Recursive Use exponential smoothing for prediction

2.3.1.3. Double exponential smoothing

2.3.1.4. Triple exponential smoothing

2.4. Evalueation

2.4.1. 10 - fold cross vaidation

2.4.1.1. Dataleakage!

2.4.1.1.1. CV on the most recent period and testing on the past periodes

3. Association Analysis

3.1. are items purchased related? Needs transformtion in binary variables "Browswer Type== Chrome"or discretization problem: many values, small intervals -> low support large intervals -> low confidence

3.1.1. Pearsons correlation coefficient

3.1.2. Correlation Analysis

3.1.3. Association Analysis Given a set of transactions, find rules

3.1.3.1. Rules: predict the occuence of item given occurences of other items in transaction

3.1.3.1.1. Example: {diaper} -> {beer} Given diaper, likely that person will buy beer (Not deterministic!)

3.1.3.1.2. Frequent Itemset: - one or more item set - Not ordered - Order comes from the # of times the product is bought together

3.1.3.1.3. Support count(sigma): frequency of occurence

3.1.3.1.4. Support (s): How frequently the itemset in data. High -> Itemset is more common in T Number of T containing X/total T

3.1.3.1.5. Confidence c : P that consequent items will be purchased given the condition purchased. High= stronger association between items How often items in Y appear in transactions that contain X Support (X or Y)/ Support (X)

3.1.3.1.6. Association Rule: Implication in form X -> Y where both are itemsets Is same as P(Y|X) X = Condition Y= Consequent

3.1.4. Subgroup Discovery: Find all patterns that can explain target var

3.1.4.1. SGD vs Classification: Goal: Accuaracy AND learn as much as possible about elefants

3.1.4.1.1. Algos: EXPLORA

3.1.4.1.2. Metrics:Wheighted relative Accuracy (Accuracy and coverage) P(ST)-P(S)*P(T) If P(S) and P(T) independent, WRAcc = 0 Optimum is P(T)-P^2(T): Vgl with result High P(ST): High coverage Low P(S)-P(ST): accurate subgroup High WRAcc high accuracy (duh)

4. Intro

4.1. Why DM?

4.1.1. Large Quantities of different type of data is collected

4.1.1.1. DM methodes : discover patterns

4.1.1.1.1. and help with decision making

4.1.1.2. Traditional approaches unsuitable bc of:

4.1.1.2.1. large amount of data

4.2. What is DM?

4.2.1. non -trivial process of identifiying valid, novel, usefull and understandable patterns in data

4.3. DM Process

4.3.1. Selection and Profiling

4.3.1.1. What is available, usefull?

4.3.1.1.1. Visualize data and identify problems

4.4. Preprocession

4.4.1. transform data into a representation that is suitable for DM methodes

4.4.1.1. Scales of attributes (num,)

4.4.1.2. number of dimensions (# attributes)

4.5. Transformation

4.5.1. Discretization

4.5.2. Feature subset selection

4.5.3. Embeddings

4.6. Data Mining

4.6.1. Input: preprocessed data output: model

4.6.1.1. Apply DM method

4.6.1.1.1. Evaluate resulting model/patters

4.7. Interpretation

4.7.1. Output: patterns

4.7.1.1. gain knowledge

4.7.1.1.1. use in buisness context

4.8. DM Methodes

4.8.1. desriptive (find patterns)

4.8.1.1. unsupervised

4.8.1.1.1. classification (binary)

4.8.2. Predictive (predict unknown values of variable)

4.8.2.1. supervised

4.8.2.1.1. Regression (countinous)

5. test performance is inflated

5.1. Errors

5.1.1. What types of sources? Sensors, Wildcodes, bugs in processing code

5.1.1.1. Solution: Remove outside intervall with domain knowledge (-30, 20 grad)

5.1.1.1.1. anomality detection

5.2. Missing Values

5.2.1. What kind of reasons? Sensors, Info not provided by respondents, Dataloss

5.2.1.1. Solution

5.2.1.1.1. Replace (simple imputation)

5.2.1.1.2. Remove (when missing high and only few rows affected

5.2.1.1.3. Predict (Advanced Imputation) to fill treat missing values as learning problem Target: missing value Training data: instances where feature is there

5.2.2. Random vs not random

5.2.2.1. random: holds information as well! "are you drinking?" subpopulation: "how long pregnant?"

5.2.2.1.1. Imputing, predicting can lead to information loss!

5.3. Unbalanced distribution NEVER balance test set! Problem: Decision tree doesnt find any splitting that improves quality)

5.3.1. Solution Goal: As much and as diverse data as possible

5.3.1.1. Resampling

5.3.1.1.1. Down (problem: we want to use as much data as possible) vs upsampling (we want do use diverse training data to prevent overfitting)

5.3.1.2. Wheighting

5.4. Different Scales

5.4.1. Problem: knn sensitive to scale. Why? Eucledian DIstance

5.4.1.1. Solution

5.4.1.1.1. Normalization

5.5. False Predictors

5.5.1. Def: Target Var included in at A feature might **appear predictive** but is actually **deterministic** based on domain knowledge. Grape:Barbara -> whine:Barara

5.5.1.1. Solution

5.5.1.1.1. False Predictor is seen in...

5.5.1.1.2. learn model and drop suspect until accuracy drops

5.5.1.1.3. correlation = 1 (heatmap)

5.6. Unsupported Data Types

5.6.1. Some algos cant handle some type of data Beispiel: SVM, knn, NN and categorical data doesnt work

5.6.1.1. Cat -> Num

5.6.1.1.1. Ord ->num

5.6.1.1.2. Nom -> num

5.6.1.2. Num -> Ordninal (asso rule learning)

5.6.1.2.1. Bins and Buckets

5.6.1.3. Dates

5.6.1.3.1. formalize, parse

5.6.1.4. Tex2Vec

5.6.1.4.1. Preprocessing: Text cleanup, Tokinization, Stopword removal, Stemming

5.7. High Dimensionality

5.7.1. large number of attributes (x), scalability problem Decision tree is overfitting bc it grows too steep (too many levels of nodes) Naive bayes brauch independent features

5.7.1.1. Principal Component Analysis -> smaller set of new attributes created (as taking the most expressive from liniar combinations of existing attributes)

5.7.1.1.1. Bindes deminsions

5.7.1.2. Feature selection -> subset of attributes selection Reduce coloumns

5.7.1.2.1. Correlation Analysis

5.7.1.2.2. Filter Methodes (use wheighting criterion and select the attributed with highest wheights)

5.7.1.2.3. Wrapper Methodes: Use internal classifier and select the best feature set

5.7.1.3. After Dimensionality reduction: Sampling (Reduces Rows)

5.7.1.3.1. Stratified Sampling

5.7.1.3.2. Kennard Stone Sampling

5.8. Anomality detection (see clustering)

6. Classification I Goal: Classify unseen instances Understand the application domain as human

6.1. Lazy Learning Instance-ased Doesnt build a model Learning is only performed on unseen instances Goal: Classify unseen records as perceicly as possible

6.1.1. KNN Algo

6.1.1.1. how it works

6.1.1.1.1. 1. Select a number k (typically an odd number to avoid ties). 2. Calculate the distance (e.g., Euclidean) from the new data point to all existing labeled data points. 3. Identify the k closest neighbors. 4. Assign the most common label among the neighbors to the new data point.

6.1.1.1.2. Most common distance measure: Eucledian

6.1.2. Centroid

6.1.2.1. How Does Nearest Centroid Work?

6.1.2.1.1. 1. Compute the centroid for each class 2. Calculate the distance (e.g., Euclidean distance) from a new data point to all centroids. 3. Assign the class label of the nearest centroid to the new data point.

6.2. Eager Learning Goal: Classify unseen records and generate models that might be interpretable by humans

6.2.1. Decision tree

6.2.1.1. How to learn a decision tree?

6.2.1.1.1. Hunts Algo (see tutorial 3 in "Lets build a decision tree")

6.2.1.1.2. Gini Splits (The lower the better)

6.2.1.2. When to split the tree? Depenss on: attribute types # of ways to split (2 vs multiway) Purity

6.2.1.2.1. Nominal X

6.2.1.2.2. Ordinal

6.2.1.2.3. Cont

6.2.1.3. What is the best split? Impurity vs overfitting

6.2.1.3.1. Gini min: 0.0 (pure) max: 1-1/number of classes (impure because equally distributed and least interresting information)

6.2.1.3.2. Entropy and many others

6.2.1.4. Advantages vs Disadvantages

6.2.1.4.1. + inexpensive + fast + easy to interpret + accuracy comparable to other techniques for small datasets

6.2.1.4.2. - only one sigle attribute at a time (Parallel decision boundaries)

6.2.1.4.3. Good Practice

6.2.1.5. Decision tree vs kNN

6.2.1.5.1. Boundaries

6.2.1.5.2. Scale sensitivity

6.2.1.5.3. Runtime and Memory

6.3. Overfitting in Decision trees (fitting on a name)

6.3.1. Does not generalize well on unseen data. Goal: classify unseen examples

6.3.1.1. Occams Razor

6.3.1.1.1. Symptoms

6.3.1.1.2. Causes

6.3.1.1.3. Solution

6.4. Evaluation Metrics

6.4.1. How good is a model in classifying unseen examples Train/Test SPlit! Why? - Imagine we create a **1-NN classifier** – what would be the training error? → **Zero**, because each point is its own nearest neighbor! But this does not mean the model generalizes well.

6.4.1.1. Confusion Matrix

6.4.1.1.1. Accuarcy Correct Predictions / all Predictions

6.4.1.1.2. Error Rate 1- Accuracy

6.4.1.1.3. Precision and Recall

6.4.1.1.4. Cost sensitive model

6.4.1.1.5. ROC curves (for Knn and Naive Bayes)

7. Classification II

7.1. Naive Bayes -> Probabilistic Classification

7.1.1. Anwendung beachten und am ende normalisieren

7.1.1.1. Bayes Classifier

7.1.1.1.1. Given attributes, how likely is class label c for new record? (Binary class)

7.1.1.2. Handling numerical attributes

7.1.1.2.1. discretizise

7.1.1.2.2. normalize and calculate the base P normally

7.1.1.2.3. Use different distribution as density function

7.1.1.3. Missing values

7.1.1.3.1. since multiplication in bayes, the nominator will be 0 :broken_heart: Thats a problem if we add another term

7.1.1.3.2. Unseen record

7.1.1.4. Decision boundary

7.1.1.4.1. soft margins, uncertain areas, random shapes, larger

7.1.1.5. Pro and con

7.1.1.5.1. works well even independece ass violated

7.1.1.5.2. Problem: too many redundant attributes (select subset!)

7.2. Support Vector Machines

7.2.1. for continous attributes But still : Class binary

7.2.1.1. good for high dimensional data

7.2.1.1.1. Goal: Fit a (linear) Hyperplane (desicion boundary)

7.2.1.1.2. handles Non-linearity

7.2.1.1.3. Strengths and Limitations

7.3. Artificial Neural Networks

7.3.1. Layout: its complicated. watch a video Function, Input/output Bias term, activation value

7.3.1.1. Algo for training ANNs

7.3.1.1.1. 0. Initialize wheights 1. Forward pass: compute output 2.Compute loss 3. Backpropagation: compute gradients 4. update wheight Iterate unil error is minimized

7.3.2. Types of deep learnin models

7.3.2.1. CNN: image recognition

7.3.2.2. BERT

7.3.2.2.1. Pretrain finetune language models

7.3.2.3. Instruct lanuage models

7.3.2.4. Generative models

7.4. Evaluation Methods Is not "how to measure perfomance"

7.4.1. How to obtain reliable results?

7.4.1.1. Evaluating modelperformance. How? TT Split. But what is the optimal Split for reliable estimate?

7.4.1.1.1. Holdout Method

7.4.1.1.2. leave one out

7.4.1.1.3. (k-fold)Cross validation (outer loop) Splits the data into **k equally sized subsets** (usually 10) - Each subset in turn is used for testing, and the remainder for training - The error estimates are averaged over all subsets to yield the overall error estimate

7.4.1.2. OPtimize Hyperparameter and Model Selection 🛠 The complete learning procedure is thus: - Hyperparameter Tuning ➡️ pick best hyperparameters - Training ➡️ find best parameters - Testing model performance on *unseen* test data via CV split (model evaluation, see above)

7.4.1.2.1. Hyperparameter: value set b4 learing. influences learning process (# of hidden layers in ANN)

7.5. Validating and comparing models Is NOT Evaluation Methodes this here are no specific techniques but a overall strategy to check if model generalizes well

7.5.1. Overfitting revisited

7.5.1.1. types

7.5.1.1.1. use testset for training

7.5.1.1.2. tune paras against test set and select the best para based on test set

7.5.1.1.3. use test set in feature constuction (av. orders by customer)

7.5.2. Overtuning problem Search to haard for best hyperpara using info from validation set intensive tuning for publishing Model overfits validation set and gernealizes poorly

7.5.2.1. Models overfit to past data

7.5.2.2. performance on unseen data is overestimated -> disappointing customers

7.5.2.3. cold start problem: predicting smth never seen before

7.5.3. Validating a Better(?) model. Is performance better by chance or by design How to compare models aside from error rate? (gegenteil von accuracy)

7.5.3.1. 1. size of test set

7.5.3.1.1. Modelis better if error difference is observed on larger test set (2000 vs 40)

7.5.3.2. 2. slap a Confidence inervalls on the binom distribution

7.5.3.2.1. Significance tests: Z test (n> 30) (bin approximates the cuassioan if n >30 due to CLT)

7.5.3.3. Occams Razor

7.5.3.3.1. if two models dont significantly perform better, choos simpler one

7.5.3.4. 3. Variance affects with of confidence intervalls what happens if i repeat that on diffeent test /training set?

7.5.3.4.1. Descriptives (ST Deviation, pairwise comparison)

7.5.3.4.2. sign test 1. ignore tie 2. Count wins of model a and model b 3. compare critical value (# of test - tie) 4. Decide over ho

7.5.3.4.3. wilcoxon signed rank test ignore ties sum up R- and R+ use one sided t-test

7.5.3.5. Ablation studies measuring model simplicity

7.5.3.5.1. what happens if leave out a step of the piplene. Is model simpler and stayys just as reliable?

8. Ensambles : Parashift -> many simple learners need certain accuracy and diversity in answers. so that ensable gets answer right and prediction on a new example differs also Have diverse base classifiers (in practice these are not independent (Dt, naive bayes, knn). we then combine these results in single prediction. But how?

8.1. inverse distance wheighting 1/distance

8.1.1. Probability of wrong prediction. (Binom. Likelihood) gets smaller the more base classfiers we have (in theory) Hard in practice bc baselearners are not independent of each other And we need diversity and this drives the error rate

8.1.1.1. Causes of errors: Biases in data samples -> overfitting

8.1.1.1.1. Solution: Resample data so that only one model would overfit to each noise point and would be overvoted

8.1.2. Ensamble makes wrong prediction if the majority of classifier makes wrong prediction

8.1.2.1. In theory: We can lower Error infinetely if we just add more base learners

8.1.2.1.1. Bin. Likelihood but assumption is independence of base learners. Violated in practice. Why? Because we need diversity also and that drives error rate

8.2. --Boosting-- Train set of classifier one after another where later classifiers focus on misclassified examples from earlier learners

8.2.1. Realization: Multiple iteration with different wheights -sucessifly increase wheight of inncorretly classified examples -so they are more important in next iterations -combine results of all iterations wheighted by respective error measures

8.2.1.1. AdaBoost Algo

8.2.1.1.1. Hypothesis space (descion boundary)

8.2.2. Error rate

8.3. --Stacking-- until now: xoxx -> x 3(x)2(o) -> 2/3 now: Classifier on individual votes

8.3.1. Idea: Metaclassifier

8.3.1.1. Problem: Overfitting due to dumb learners (is perfect in training and meta classifyier puts lots of confidence on dumb lerner)

8.3.1.1.1. Solution:

8.3.2. Variants

8.3.2.1. keep og Attritbutes

8.3.2.1.1. prediction of base learners are additional attributes

8.3.2.2. Use confidence intervals

8.4. Learning with costs Give wheight to some errors

8.4.1. MetaCost Algo Goal: relable traning data with optimal class (minimized cost)

8.4.1.1. Metacost vs Balancing

8.4.1.1.1. unbalanced set: Bias towards larger class, balancing gives more meaningful models Metvost: unbalances the dataset by urpose, labelling more instances with cheap class-> learner is biased towards cheap class (avoids expensive missclassifications, more false alarms)

8.4.2. Cost function on ordinal data

9. Regression

9.1. What is a regression?

9.1.1. Regression vs Classification -Discrete -> Continous Variables -Supervised -> unsupervised (prediction) -predicting known labels -> predicting values that might not be in training data -also other evaluation methodes

9.1.1.1. Classification stuff that I can do with a regression (Interpolation)

9.1.1.1.1. ANNs for Regression non-lin relationships

9.1.1.1.2. Descision tree -> Regression tree

9.1.1.1.3. ANNs for Regression

9.1.1.1.4. Transformation

9.1.1.2. Typical Regression stuff

9.1.1.2.1. Lin Regression (Extrapolation)

9.1.1.2.2. Non Linear Regression

9.1.1.3. Evaluation Metrics

9.1.1.3.1. MAE

9.1.1.3.2. MSE

9.1.1.3.3. RSME*

9.1.1.3.4. Pearsons

9.1.1.3.5. R^2

9.1.1.4. Bias/Variance Tradeoff

9.1.1.4.1. Goal: Learn model that generalizes well to unseen data

10. Cluster Analysis Unsupervised Learning, preprocessing (descriptive, data has no target attribute) vs supervised learning where classes are known beforehand by humans and we use patterns for predictoin of target attribute of new data

10.1. Def: finding groups of objects such that objects are similar to another, different to other -> goal: Get better understanding of (patterns in) data

10.1.1. Cluster analysis (partitional and density based clustering) division of data in non-overlapping subsets such that object is in exactly one subset

10.1.1.1. Needs: Algo (partitional based, density based...), proximity measure (eucledian distance, cosine similrity...) measure of quality (minimal SSE)

10.1.1.1.1. K-Means Clustering algo (partitional) assumes that the clusters are blob or ball-shaped Why? It minimizes the Euclidean distance between points and their cluster centroids.

10.1.1.1.2. DBSCAN Density based clustering Density: # of points within a specified radius

10.1.1.1.3. Proximity measures

10.1.1.1.4. Anomality/ Outlier Detection (which are not extreme values)

10.1.2. Cluster analysis (hierarchical) Every object belongs to cluster and partent clusters so: overlapping Output: classification and hierarchy Produce set of nested clusters organized in hierarchical tree (x achse: bounds of clusters y achse: former distance between merged clusters)

10.1.2.1. strenghts: No assumption about # of clusters + Dendogram can be cut at any point Merging clusters is to find parents (kind of a pattern) Used for Taxonomies

10.1.2.2. Buttom-up: For each instance in a cluster, merge clusters recursivly to find parents

10.1.2.2.1. 1. Have proximity Matrix 2. Put 2 DP that are closest into cluster and repeat 3. After merging steps we have some clusters How do we determine which clusters are clostest so that we can merge them?

10.1.2.3. Top- Down: All instances in one cluster, split clusters recursivly to find children. End: alls clusters contain only one example

10.1.2.3.1. distance metric is similar to complete linkage (use distance to farthest isntance when splitting

10.1.2.4. Cutting a hierarchical clusterin yields a partitional cluster

10.1.2.4.1. we can choos arbitratry numbe of clusters

10.1.2.5. Problem and Limitation

10.1.2.5.1. greedy algo: Descisions taken cannot be undone see what we wrote above for the buttom - up approaches High space and Time complexity ( O(N^3)) complexity mostly