Data Mining Green Box: Algo to understand Blue Box: practice Calclulations here Violet Box: Discu...

Datamining MMSDS

Lancez-Vous. C'est gratuit
ou s'inscrire avec votre adresse e-mail
Data Mining Green Box: Algo to understand Blue Box: practice Calclulations here Violet Box: Discussion excercises par Mind Map: Data Mining Green Box: Algo to understand Blue Box: practice Calclulations here Violet Box: Discussion excercises

1. Multimodal Data Data from different sources Goal: Encode all types of data with numbers bc predictive models operate on num input Goal: 1) Focus on the most impoertant features (Reducing dimensionality) 2) Encoding 3) put different modes of data into one model

1.1. Types of data

1.1.1. Images, Audio, Video, Time Series

1.1.1.1. are all encoded in Numbers

1.1.1.1.1. TF-IDF Vectors to Text

1.1.1.2. tabular data already suitable

1.2. Convolutional Neural Networks (image processing)

1.2.1. Output is prediced from the Values at the hidden layer

1.2.1.1. So we can reduce the input layer to the last hidden layer to describe the output (Like PCA)

1.2.1.1.1. 1. We use Pre-trained Networks -> Use representations from inner layers (second to last) Train new model on top

1.3. Auto Encoders Is a NN trained to reproduce its input Uses compressed hidden representation like PCA but non-lin transofrmation

1.3.1. Instead of training with same in and output: we add random noise in input, remove the noud and have clean example of output

1.3.1.1. Example: Word2vec: input: random words and noise and output: relations (similarities) between words Man->WOman same behaviour as King-> Queen 1. Predicts word from context 2. predict context from word

1.3.1.1.1. Example: BERT: One fixed representation per Word. We can mask a word and reconstruct the whole sentnce represent not a word but longer text -> transformer based mdel for contextual word representations

1.4. Problem with time series

1.4.1. We want to feed ECG for each patient

1.4.1.1. High dimensional!

1.4.1.1.1. TSFRESH Vectorizes via feature extraction Extract statisitcal frequency features

1.4.1.1.2. Neural Autoencoders for Time Series

1.5. Network Data

1.5.1. classify people in network (who is gatkeeper)

1.5.1.1. Methodes that encode this to numbers: node2vec

1.6. Putting it all together

1.6.1. apply modality specific encoders

1.6.1.1. concatinate resulting vectors

1.6.1.1.1. use as input for classifier/regressor

2. Time series Data that comes with timestamps (Stock market deleopement)

2.1. **Trends** long-term movement or direction in the data over time. Climate change and rise in temperature

2.1.1. Goal: Detect general trend

2.1.1.1. How to model **trend curves?**

2.1.1.1.1. Freehand method (looking) doestn scale Describing

2.1.1.1.2. OLS Methode Predicting

2.1.1.1.3. moving averages Smoothes short term fluctations and highlights long term trends

2.1.1.2. Problem Trends are decomposed in 3 obtacles:

2.1.1.2.1. **Random effects** unpredictable, irregular variations in the data that cannot be explained by trends, seasonality, or cyclicality. Daily fluctuations in stock prices

2.1.1.2.2. **Seasonal effects** periodic fluctuations in the data that repeat at regular intervals (e.g., daily, monthly, yearly).

2.1.1.2.3. **Cyclical effects** longer-term fluctuations that do not follow a fixed frequency but are driven by broader factors, often external events. Ukrain war on global markts (recovery and contraction)

2.1.1.2.4. **Missing data**

2.2. How to forecast Time series?

2.2.1. Purpose: Forecasting

2.2.1.1. Autoregressive models They learn periodicities in the data Model learns weights (coefficients) for lagged values using training data.

2.2.1.2. predicting with **exponential smoothing**

2.2.1.2.1. Forecasting for data with no trend or seasonality Predicts next value by smoothing past values.

2.2.1.3. Double exponential smoothing

2.2.1.3.1. accounts for the trend component

2.2.1.4. Triple exponential smoothing

2.2.1.4.1. For data with trend + seasonality

2.3. Evalueation

2.3.1. 10 - fold cross vaidation?

2.3.1.1. Dataleakage! bc it breaks temporal order

2.3.1.1.1. CV on the most recent period and testing on the past periodes

3. Association Analysis

3.1. are items purchased related? Needs transformtion in binary variables "Browswer Type== Chrome"or discretization problem: many values, small intervals -> low support large intervals -> low confidence

3.1.1. Pearsons correlation coefficient

3.1.2. Correlation Analysis (Heatmap) his will help us understand basic pairwise relationships. -1,0+1 But doesnt scale well

3.1.3. Association Analysis Given a set of transactions, find rules

3.1.3.1. Rules: predict the occuence of item given occurences of other items in transaction

3.1.3.1.1. Example: {diaper} -> {beer} Given diaper, likely that person will buy beer (Not deterministic!)

3.1.3.1.2. **Frequent Itemset Mining**: - one or more item set - Not ordered - Order comes from the # of times the product is bought together

3.1.3.1.3. **Association Rule Generation** identify the probability of purchasing certain items given that others were bought. Strenght: Evaluate with confidence and lift

3.1.3.1.4. **Frequent Pairs**

3.1.4. Subgroup Discovery: Find all patterns that can explain target var

3.1.4.1. SGD vs Classification: Goal: Accuaracy AND learn as much as possible about elefants

3.1.4.1.1. Algos: EXPLORA

3.1.4.1.2. Metrics:Wheighted relative Accuracy (Accuracy and coverage) P(ST)-P(S)*P(T) If P(S) and P(T) independent, WRAcc = 0 Optimum is P(T)-P^2(T): Vgl with result High P(ST): High coverage Low P(S)-P(ST): accurate subgroup High WRAcc high accuracy (duh)

4. Intro

4.1. Why DM?

4.1.1. Large Quantities of different type of data is collected

4.1.1.1. DM methodes : discover patterns

4.1.1.1.1. and help with decision making

4.1.1.2. Traditional approaches unsuitable bc of:

4.1.1.2.1. large amount of data

4.2. What is DM?

4.2.1. non -trivial process of identifiying valid, novel, usefull and understandable patterns in data

4.3. DM Process

4.3.1. Selection and Profiling

4.3.1.1. What is available, usefull?

4.3.1.1.1. Visualize data and identify problems

4.4. Preprocession

4.4.1. transform data into a representation that is suitable for DM methodes

4.4.1.1. Scales of attributes (num,)

4.4.1.2. number of dimensions (# attributes)

4.5. Transformation

4.5.1. Discretization

4.5.2. Feature subset selection

4.5.3. Embeddings

4.6. Data Mining

4.6.1. Input: preprocessed data output: model

4.6.1.1. Apply DM method

4.6.1.1.1. Evaluate resulting model/patters

4.7. Interpretation

4.7.1. Output: patterns

4.7.1.1. gain knowledge

4.7.1.1.1. use in buisness context

4.8. DM Methodes

4.8.1. desriptive (find patterns)

4.8.1.1. unsupervised

4.8.1.1.1. classification (binary)

4.8.2. Predictive (predict unknown values of variable)

4.8.2.1. supervised

4.8.2.1.1. Regression (countinous)

5. test performance is inflated

5.1. Errors

5.1.1. What types of sources? Sensors, Wildcodes, bugs in processing code

5.1.1.1. Solution: Remove outside intervall with domain knowledge (-30, 20 grad)

5.1.1.1.1. anomality detection

5.2. Missing Values

5.2.1. What kind of reasons? Sensors, Info not provided by respondents, Dataloss

5.2.1.1. Solution

5.2.1.1.1. Replace (simple imputation)

5.2.1.1.2. Remove (when missing high and only few rows affected

5.2.1.1.3. Predict (Advanced Imputation) to fill treat missing values as learning problem Target: missing value Training data: instances where feature is there

5.2.2. Random vs not random

5.2.2.1. random: holds information as well! "are you drinking?" subpopulation: "how long pregnant?"

5.2.2.1.1. Imputing, predicting can lead to information loss!

5.3. Unbalanced distribution NEVER balance test set! Problem: Decision tree doesnt find any splitting that improves quality)

5.3.1. Solution Goal: As much and as diverse data as possible

5.3.1.1. Resampling

5.3.1.1.1. Down (problem: we want to use as much data as possible) vs upsampling (we want do use diverse training data to prevent overfitting)

5.3.1.2. Wheighting

5.4. Different Scales

5.4.1. Problem: knn sensitive to scale. Why? Eucledian DIstance

5.4.1.1. Solution

5.4.1.1.1. Normalization

5.5. False Predictors

5.5.1. Def: Target Var included in at A feature might **appear predictive** but is actually **deterministic** based on domain knowledge. Grape:Barbara -> whine:Barara

5.5.1.1. Solution

5.5.1.1.1. False Predictor is seen in...

5.5.1.1.2. learn model and drop suspect until accuracy drops

5.5.1.1.3. correlation = 1 (heatmap)

5.6. Unsupported Data Types

5.6.1. Some algos cant handle some type of data Beispiel: SVM, knn, NN and categorical data doesnt work

5.6.1.1. Cat -> Num

5.6.1.1.1. Ord ->num

5.6.1.1.2. Nom -> num

5.6.1.2. Num -> Ordninal (asso rule learning)

5.6.1.2.1. Bins and Buckets

5.6.1.3. Dates

5.6.1.3.1. formalize, parse

5.6.1.4. Tex2Vec

5.6.1.4.1. Preprocessing: Text cleanup, Tokinization, Stopword removal, Stemming

5.7. High Dimensionality

5.7.1. large number of attributes (x), scalability problem Decision tree is overfitting bc it grows too steep (too many levels of nodes) Naive bayes brauch independent features

5.7.1.1. Principal Component Analysis -> smaller set of new attributes created (as taking the most expressive from liniar combinations of existing attributes)

5.7.1.1.1. Bindes deminsions

5.7.1.2. Feature selection -> subset of attributes selection Reduce coloumns

5.7.1.2.1. Correlation Analysis

5.7.1.2.2. Filter Methodes (use wheighting criterion and select the attributed with highest wheights)

5.7.1.2.3. Wrapper Methodes: Use internal classifier and select the best feature set

5.7.1.3. After Dimensionality reduction: Sampling (Reduces Rows)

5.7.1.3.1. Stratified Sampling

5.7.1.3.2. Kennard Stone Sampling

5.8. Anomality detection (see clustering)

6. Classification I Goal: Classify unseen instances Understand the application domain as human

6.1. Lazy Learning Instance-ased Doesnt build a model Learning is only performed on unseen instances Goal: Classify unseen records as perceicly as possible

6.1.1. KNN Algo

6.1.1.1. how it works

6.1.1.1.1. 1. Select a number k (typically an odd number to avoid ties). 2. Calculate the distance (e.g., Euclidean) from the new data point to all existing labeled data points. 3. Identify the k closest neighbors. 4. Assign the most common label among the neighbors to the new data point.

6.1.1.1.2. Most common distance measure: Eucledian

6.1.2. Centroid

6.1.2.1. How Does Nearest Centroid Work?

6.1.2.1.1. 1. Compute the centroid for each class 2. Calculate the distance (e.g., Euclidean distance) from a new data point to all centroids. 3. Assign the class label of the nearest centroid to the new data point.

6.2. Eager Learning Goal: Classify unseen records and generate models that might be interpretable by humans

6.2.1. Decision tree

6.2.1.1. How to learn a decision tree?

6.2.1.1.1. Hunts Algo (see tutorial 3 in "Lets build a decision tree")

6.2.1.1.2. Gini Splits (The lower the better)

6.2.1.2. When to split the tree? Depenss on: attribute types # of ways to split (2 vs multiway) Purity

6.2.1.2.1. Nominal X

6.2.1.2.2. Ordinal

6.2.1.2.3. Cont

6.2.1.3. What is the best split? Impurity vs overfitting

6.2.1.3.1. Gini min: 0.0 (pure) max: 1-1/number of classes (impure because equally distributed and least interresting information)

6.2.1.3.2. Entropy and many others

6.2.1.4. Advantages vs Disadvantages

6.2.1.4.1. + inexpensive + fast + easy to interpret + accuracy comparable to other techniques for small datasets

6.2.1.4.2. - only one sigle attribute at a time (Parallel decision boundaries)

6.2.1.4.3. Good Practice

6.2.1.5. Decision tree vs kNN

6.2.1.5.1. Boundaries

6.2.1.5.2. Scale sensitivity

6.2.1.5.3. Runtime and Memory

6.3. Overfitting in Decision trees (fitting on a name)

6.3.1. Does not generalize well on unseen data. Goal: classify unseen examples

6.3.1.1. Occams Razor

6.3.1.1.1. Symptoms

6.3.1.1.2. Causes

6.3.1.1.3. Solution

6.4. Evaluation Metrics

6.4.1. How good is a model in classifying unseen examples Train/Test SPlit! Why? - Imagine we create a **1-NN classifier** – what would be the training error? → **Zero**, because each point is its own nearest neighbor! But this does not mean the model generalizes well.

6.4.1.1. Confusion Matrix

6.4.1.1.1. Accuarcy Correct Predictions / all Predictions

6.4.1.1.2. Error Rate 1- Accuracy

6.4.1.1.3. Precision and Recall

6.4.1.1.4. Cost sensitive model

6.4.1.1.5. ROC curves (for Knn and Naive Bayes)

7. Classification II

7.1. Naive Bayes -> Probabilistic Classification

7.1.1. Anwendung beachten und am ende normalisieren

7.1.1.1. Bayes Classifier

7.1.1.1.1. Given attributes, how likely is class label c for new record? (Binary class)

7.1.1.2. P(C) (Prior)

7.1.1.2.1. count

7.1.1.3. Handling numerical attributes

7.1.1.3.1. discretizise

7.1.1.3.2. normalize and calculate the base P normally

7.1.1.3.3. Use different distribution as density function

7.1.1.4. Missing values

7.1.1.4.1. since multiplication in bayes, the nominator will be 0 :broken_heart: Thats a problem if we add another term

7.1.1.4.2. Unseen record

7.1.1.5. Decision boundary

7.1.1.5.1. soft margins, uncertain areas, random shapes, larger

7.1.1.6. Pro and con

7.1.1.6.1. works well even independece ass violated

7.1.1.6.2. Problem: too many redundant attributes (select subset!)

7.2. Support Vector Machines

7.2.1. for continous attributes But still : Class binary

7.2.1.1. good for high dimensional data

7.2.1.1.1. Goal: Fit a (linear) Hyperplane (desicion boundary)

7.2.1.1.2. handles Non-linearity

7.2.1.1.3. Strengths and Limitations

7.3. Artificial Neural Networks

7.3.1. Layout: its complicated. watch a video Function, Input/output Bias term, activation value

7.3.1.1. Algo for training ANNs

7.3.1.1.1. 0. Initialize wheights 1. Forward pass: compute output 2.Compute loss 3. Backpropagation: compute gradients 4. update wheight Iterate unil error is minimized

7.3.2. Types of deep learnin models

7.3.2.1. CNN: image recognition

7.3.2.2. BERT

7.3.2.2.1. Pretrain finetune language models

7.3.2.3. Instruct lanuage models

7.3.2.4. Generative models

7.4. Evaluation Methods Is not "how to measure perfomance"

7.4.1. How to obtain reliable results?

7.4.1.1. Evaluating modelperformance. How? TT Split. But what is the optimal Split for reliable estimate?

7.4.1.1.1. Holdout Method

7.4.1.1.2. leave one out

7.4.1.1.3. (k-fold)Cross validation (outer loop) Splits the data into **k equally sized subsets** (usually 10) - Each subset in turn is used for testing, and the remainder for training - The error estimates are averaged over all subsets to yield the overall error estimate

7.4.1.2. OPtimize Hyperparameter and Model Selection 🛠 The complete learning procedure is thus: - Hyperparameter Tuning ➡️ pick best hyperparameters - Training ➡️ find best parameters - Testing model performance on *unseen* test data via CV split (model evaluation, see above)

7.4.1.2.1. Hyperparameter: value set b4 learing. influences learning process (# of hidden layers in ANN)

7.5. Validating and comparing models Is NOT Evaluation Methodes this here are no specific techniques but a overall strategy to check if model generalizes well

7.5.1. Overfitting revisited

7.5.1.1. types

7.5.1.1.1. use testset for training

7.5.1.1.2. tune paras against test set and select the best para based on test set

7.5.1.1.3. use test set in feature constuction (av. orders by customer)

7.5.2. Overtuning problem Search to haard for best hyperpara using info from validation set intensive tuning for publishing Model overfits validation set and gernealizes poorly

7.5.2.1. Models overfit to past data

7.5.2.2. performance on unseen data is overestimated -> disappointing customers

7.5.2.3. cold start problem: predicting smth never seen before

7.5.3. Validating a Better(?) model. Is performance better by chance or by design How to compare models aside from error rate? (gegenteil von accuracy)

7.5.3.1. 1. size of test set

7.5.3.1.1. Modelis better if error difference is observed on larger test set (2000 vs 40)

7.5.3.2. 2. slap a Confidence inervalls on the binom distribution

7.5.3.2.1. Significance tests: Z test (n> 30) (bin approximates the cuassioan if n >30 due to CLT)

7.5.3.3. Occams Razor

7.5.3.3.1. if two models dont significantly perform better, choos simpler one

7.5.3.4. 3. Variance affects with of confidence intervalls what happens if i repeat that on diffeent test /training set?

7.5.3.4.1. Descriptives (ST Deviation, pairwise comparison)

7.5.3.4.2. sign test 1. ignore tie 2. Count wins of model a and model b 3. compare critical value (# of test - tie) 4. Decide over ho

7.5.3.4.3. wilcoxon signed rank test ignore ties sum up R- and R+ use one sided t-test

7.5.3.5. Ablation studies measuring model simplicity

7.5.3.5.1. what happens if leave out a step of the piplene. Is model simpler and stayys just as reliable?

8. Ensambles : Parashift -> many simple learners need certain accuracy and diversity in answers. so that ensable gets answer right and prediction on a new example differs also Have diverse base classifiers (in practice these are not independent (Dt, naive bayes, knn). we then combine these results in single prediction. But how?

8.1. By --wheighting-- assign wheight based on importance Errors on high wheight instances are penalized more

8.1.1. Probability of wrong prediction. (Binom. Likelihood) gets smaller the more base classfiers we have (in theory) Hard in practice bc baselearners are not independent of each other And we need diversity and this drives the error rate

8.1.1.1. Causes of errors: Biases in data samples -> overfitting

8.1.1.1.1. Idea: Metaclassifier Train n base classifiers get their prediction on training data Form a new dataset (meta-data) Attributes: predictions from each base classifier Train meta classifier on new dataset

8.1.2. Ensamble makes wrong prediction if the majority of classifier makes wrong prediction

8.1.2.1. In theory: We can lower Error infinetely if we just add more base learners

8.1.2.1.1. Bin. Likelihood but assumption is independence of base learners. Violated in practice. Why? Because we need diversity also and that drives error rate

8.2. --Voting-- Most basic Knn

8.3. By --Boosting-- Train set of classifier one after another where later classifiers focus on misclassified examples from earlier learners (get increased wheight)

8.3.1. Realization: Multiple iteration with different wheights -sucessifly increase wheight of inncorretly classified examples -so they are more important in next iterations -combine results of all iterations wheighted by respective error measures

8.3.1.1. AdaBoost Algo

8.3.1.1.1. Hypothesis space (descion boundary)

8.3.2. Error rate

8.4. By --Stacking-- until now: xoxx -> x 3(x)2(o) -> 2/3 now: Classifier on individual votes

8.4.1. Idea: Metaclassifier

8.4.1.1. Problem: Overfitting due to dumb learners (is perfect in training and meta classifyier puts lots of confidence on dumb lerner)

8.4.1.1.1. Solution:

8.4.2. Variants

8.4.2.1. keep og Attritbutes

8.4.2.1.1. prediction of base learners are additional attributes

8.4.2.2. Use confidence intervals

8.5. Learning with costs Give wheight to some errors

8.5.1. MetaCost Algo Goal: relable traning data with optimal class (minimized cost)

8.5.1.1. Metacost vs Balancing

8.5.1.1.1. unbalanced set: Bias towards larger class, balancing gives more meaningful models Metvost: unbalances the dataset by urpose, labelling more instances with cheap class-> learner is biased towards cheap class (avoids expensive missclassifications, more false alarms)

8.5.2. Cost function on ordinal data

9. Regression

9.1. What is a regression?

9.1.1. Regression vs Classification -Discrete -> Continous Variables -Supervised -> unsupervised (prediction) -predicting known labels -> predicting values that might not be in training data -also other evaluation methodes

9.1.1.1. Classification stuff that I can do with a regression (Interpolation)

9.1.1.1.1. KNNs for Regression non-lin, non-parametric relationships Good for: smol ds with non-lin relationships

9.1.1.1.2. Descision tree -> Regression tree -partitioning of inputspace into regions -builds trees by selecting splits that miminize MSE in region risk of overfitting

9.1.1.1.3. ANNs for Regression

9.1.1.1.4. Transformation

9.1.1.2. Typical Regression stuff

9.1.1.2.1. Lin Regression (Extrapolation)

9.1.1.2.2. Non Linear Regression

9.1.1.3. Evaluation Metrics

9.1.1.3.1. MAE (mean absolute errors)

9.1.1.3.2. MSE

9.1.1.3.3. RSME*

9.1.1.3.4. Pearsons

9.1.1.3.5. R^2

9.1.1.4. Bias/Variance Tradeoff

9.1.1.4.1. Goal: Learn model that generalizes well to unseen data

10. Cluster Analysis Unsupervised Learning, preprocessing (descriptive, data has no target attribute) vs supervised learning where classes are known beforehand by humans and we use patterns for predictoin of target attribute of new data

10.1. Def: finding groups of objects such that objects are similar to another, different to other -> goal: Get better understanding of (patterns in) data

10.1.1. Cluster analysis (partitional and density based clustering) division of data in non-overlapping subsets such that object is in exactly one subset

10.1.1.1. Needs: Algo (partitional based, density based...), proximity measure (eucledian distance, cosine similrity...) measure of quality (minimal SSE)

10.1.1.1.1. K-Means Clustering algo (partitional) assumes that the clusters are blob or ball-shaped Why? It minimizes the Euclidean distance between points and their cluster centroids. Choose K 2 Initialize 𝐾 centroids 3 Assign each point to nearest centroid 4 Update centroids to mean of assigned points 5 Repeat until convergence 6 Output clusters

10.1.1.1.2. DBSCAN Density based clustering Density: # of points within a specified radius

10.1.1.1.3. Proximity measures

10.1.2. Anomality/ Outlier Detection (which are not extreme values)

10.1.2.1. statistic based: Assume parametric model and apply statisical tests that depend on data or parameter distribution

10.1.2.1.1. MAD

10.1.2.1.2. IQR

10.1.2.2. distance based comute D between the datapoints. define outliers like this

10.1.2.2.1. top datapoints whose distance to the kth nearest neighbor is greatest

10.1.2.2.2. data point for which there are fewe than p neighbors within distance D

10.1.2.2.3. top n data poitns whose average distance to the k nearest neighbors is greatest

10.1.2.3. density based: For each Point, Compute denstity of its local neighborhoos If density is lower than average density, point is a outlier

10.1.2.3.1. LOF

10.1.2.3.2. DBSCAN

10.1.2.4. Clustering based: Cluster data into groups with different density. choose point in small cluster as candidate outliers, compute distance between candidate points and non - candidate clusters If candidate points are far from all other non-candidate points, they are outliers

10.1.2.4.1. LOF

10.1.2.4.2. Isolation Forest

10.1.3. Cluster analysis (hierarchical) Every object belongs to cluster and partent clusters so: overlapping Output: classification and hierarchy Produce set of nested clusters organized in hierarchical tree (x achse: bounds of clusters y achse: former distance between merged clusters)

10.1.3.1. strenghts: No assumption about # of clusters + Dendogram can be cut at any point Merging clusters is to find parents (kind of a pattern) Used for Taxonomies

10.1.3.2. Buttom-up: For each instance in a cluster, merge clusters recursivly to find parents

10.1.3.2.1. 1. Have proximity Matrix 2. Put 2 DP that are closest into cluster and repeat 3. After merging steps we have some clusters How do we determine which clusters are clostest so that we can merge them?

10.1.3.3. Top- Down: All instances in one cluster, split clusters recursivly to find children. End: alls clusters contain only one example

10.1.3.3.1. distance metric is similar to complete linkage (use distance to farthest isntance when splitting

10.1.3.4. Cutting a hierarchical clusterin yields a partitional cluster

10.1.3.4.1. we can choos arbitratry numbe of clusters

10.1.3.5. Problem and Limitation

10.1.3.5.1. greedy algo: Descisions taken cannot be undone see what we wrote above for the buttom - up approaches High space and Time complexity ( O(N^3)) complexity mostly