Knowledge Discovery as studied in Marburg, SS18

Get Started. It's Free
or sign up with your email address
Knowledge Discovery as studied in Marburg, SS18 by Mind Map: Knowledge Discovery as studied in Marburg, SS18

1. Single Variable

1.1. Transformations

1.1.1. Ladder of Powers

1.1.1.1. Capable of leading to presentable knowledge

1.1.1.2. Power Laws

1.1.1.2.1. Pareto 80 20 Rule

1.1.1.2.2. Gini Coefficient

1.1.1.2.3. 20% of Students do 80% of the work

1.1.1.3. Z-Transform

1.1.1.3.1. Normalise with standard deviation

1.1.2. Box Cox

1.1.2.1. Optimal (in some way)

1.1.2.2. Abschätzung des exponenten der Laddder of powers

1.1.3. CDF Trans

1.1.3.1. Leads to uniform distribution

1.1.3.2. Widely useless

1.2. Distributions

1.2.1. Streuungsmaße

1.2.1.1. Variance

1.2.1.2. Standard deviation

1.2.1.3. Mean

1.2.1.3.1. MAD

1.2.2. PDF

1.2.2.1. Kerndichteschätzer

1.2.2.1.1. Histogram

1.2.2.1.2. Pareto Density Estimation

1.2.3. ECDF

1.2.3.1. Accurately determined by cumulation

1.2.3.2. Observations <=x/ Total Observations

1.2.4. QQ-Plot

1.2.4.1. Derivation

1.2.4.2. QQnorm

1.2.5. Gauss distribution

1.2.6. Multi-modal distributions

1.2.6.1. Gauss Mixture Model

1.2.6.1.1. Manual but good

1.2.6.2. Expectation Maximisation (EM)

1.2.6.2.1. Great at fitting locally

1.2.6.2.2. Issue: Saddle points

1.2.6.2.3. Herleitung

1.2.6.3. Bayes Boundaries

1.2.6.3.1. Boundaries of groups according to Bayes Theorem P(A|B) = P(B|A) * P(A) / P(B)

1.2.6.3.2. Bayes Classification Function

1.2.6.3.3. Cost Function

1.2.6.3.4. Assumptions:

1.2.6.4. Finding correct number of modes

1.2.6.4.1. Autoclass (Bayes)

1.3. Statistical Tests

1.3.1. Likelihood

1.3.1.1. Log Likelihood

1.3.1.1.1. Expectation Maximisation

1.3.1.2. Maximum Likelihood

1.3.1.3. Gauss Mixture Model

1.3.2. Chi Squared

1.3.2.1. Prüfgröße

1.3.2.1.1. Sollte aus der Chi Verteilung stammen

1.3.2.2. Degrees of Freedom

1.3.2.2.1. The number of independent variables - 1

1.3.2.3. Chi squared Verteilung

1.3.2.4. Test: gleichheit der Verteilungen

1.3.2.4.1. P value tells us the likelihood of Test, given the distribution

1.3.2.5. Test: Unabhängigkeit der Variablen

1.3.3. Kolmogorov-Smirnov

1.3.4. T-test

2. Multiple Variables

2.1. Correlations

2.1.1. Scatterplot

2.1.1.1. Between all variables

2.1.2. Coefficients

2.1.2.1. Pearson

2.1.2.1.1. cov(a,b)/ stdev(a) * stdev(b)

2.1.2.2. Spearman

2.1.2.3. Kendall's Tau

2.2. Distances

2.2.1. Metriken

2.2.1.1. Anforderungen

2.2.1.1.1. Identität

2.2.1.1.2. Dreiecksungleichung

2.2.1.1.3. Symmetrie

2.2.1.2. Korellationsdistanz

2.2.1.3. Cosine distance

2.2.1.4. Minkowski Metriken

2.2.1.4.1. r = 1 City Block

2.2.1.4.2. r = 2 Euklid

2.2.1.4.3. r -> inf Tschebyscheff

2.2.1.4.4. Only metrics that fulfill

2.2.1.5. Minkowski Distanzen

2.2.2. Ähnlichkeitsmaße

2.2.2.1. Positivität

2.2.2.2. Maximale Ähnlichkeit = 1

2.2.2.3. Symmetrie

2.2.2.4. s=1-d(x,y)/max(d(x,y))

2.2.2.5. Covariance

2.3. Projections

2.3.1. ESOM = Emergent Self Organising Maps

2.3.1.1. U-Matrix

2.3.1.1.1. Höhe der Neuronen = Distanz zur Nachbarschaft

2.3.1.1.2. Kann Chainlink entfalten

2.3.1.2. Nachbarschaftsfunktionen

2.3.1.3. Lernrate

2.3.2. PCA

2.3.3. ICA

2.4. Clustering

2.4.1. Partitioniernd

2.4.1.1. K-Means

2.4.2. Hierarchisch

2.4.2.1. Ward

2.4.2.2. Single Linkage

2.4.2.3. Divisiv/ Agglomerativ

3. Pre-Processing

3.1. Raw Data

3.1.1. NaN values

3.1.1.1. % of NaN

3.1.1.1.1. Shows unusable attributes

3.1.1.2. PixelMatrix

3.1.1.2.1. Scaled per Attribute

3.1.1.2.2. Shows clusters of NaN

3.1.1.3. Disregard NaNs

3.1.1.4. Imputation

3.1.1.4.1. Regression

3.1.1.4.2. K-Nearest Neighbour

3.1.2. Absolute smallest non-zero difference

3.1.2.1. Shows precision

3.1.2.2. Decide on rounding

3.1.3. Ausreißer

3.1.3.1. Boxplot

3.1.3.2. Try and remove past a threshhold of sd vs mean

3.2. Relative Änderungen

3.2.1. Rendite

3.2.1.1. (E-A)/A

3.2.2. Log Ratio

3.2.2.1. log(E/A)

3.2.3. Relative Difference

3.2.3.1. (E-A)/mean(A,E)

3.2.3.2. Keine progressive Verzinsung

3.2.3.3. Ungleicher Wertebereich