Data Science Fundamentals

Get Started. It's Free
or sign up with your email address
Data Science Fundamentals by Mind Map: Data Science Fundamentals

1. Hard Method

1.1. t_stat = t.ppf(.975, dof)

1.1.1. mean + (t_stat * std_err)

1.1.2. mean + (t_stat * std_err)

2. Data Preprocess / EDA

2.1. EDA

2.1.1. Import Data

2.1.1.1. pd.read_csv()

2.1.2. Check Data

2.1.2.1. check first 5 rows

2.1.2.1.1. df.head()

2.1.2.2. check last 5 rows

2.1.2.2.1. df.tail()

2.1.2.3. check dimension

2.1.2.3.1. df.shape

2.1.2.4. check descriptive statistics

2.1.2.4.1. df.describe()

2.1.2.5. check NaN

2.1.2.5.1. df.isnull().sum()

2.1.2.6. check columns

2.1.2.6.1. df.columns

2.1.2.7. check indexes

2.1.2.7.1. df.index

2.1.3. Export Data

2.1.3.1. pd.to_csv()

2.2. Feature Engineering

2.2.1. Dataframe

2.2.1.1. Create

2.2.1.1.1. pd.DataFrame(data=,index=,columns=)

2.2.1.2. Indexing

2.2.1.2.1. by labels

2.2.1.2.2. by integers

2.2.1.3. Inserting

2.2.1.3.1. df1['a'] = df2['b']

2.2.2. Applications

2.2.2.1. replace NaN to 'a'

2.2.2.1.1. df.fillna('a')

2.2.2.2. transpose dataframe

2.2.2.2.1. df.transpose()

2.2.2.3. replace commas

2.2.2.3.1. random_string.replace(',' , '' )

2.2.2.4. to numerical values

2.2.2.4.1. pd.to_numeric()

2.2.2.5. into group

2.2.2.5.1. df.groupby('a').mean()

2.2.2.5.2. df.groupby('a').b.mean()

2.2.2.6. drop

2.2.2.6.1. df.drop('a', axis = 1)

2.2.2.6.2. df.drop('1', axis = 0)

2.2.2.7. set index

2.2.2.7.1. df.set_index()

2.2.2.8. reset index

2.2.2.8.1. df.reset_index()

2.2.2.9. apply function to columns

2.2.2.9.1. df.apply(random_function)

2.2.2.10. count values

2.2.2.10.1. df.value_counts()

2.2.2.11. sort values

2.2.2.11.1. df.sort_values(by = , axis = , ascending = )

2.3. Data Manipulation

2.3.1. Concatenation

2.3.1.1. pd.concat([df1 , df2], axis =)

2.3.1.1.1. by row (default)

2.3.1.1.2. by column

2.3.2. Merge

2.3.2.1. df1.merge(df2, how = , on = )

2.3.2.1.1. how

2.3.3. Conditioning

2.3.3.1. condition = ((df['a'] == 4) & (df['b'] != 2))

2.3.3.1.1. df[condition]

2.3.3.2. df.query('a == 4 & b != 2')

2.3.4. Tidy

2.3.4.1. pd.melt(df, id_vars =, value_vars = ,var_name =, value_name = )

2.3.4.1.1. id_vars

2.3.4.1.2. value_vars

2.3.4.1.3. var_name

2.3.4.1.4. value_name

2.4. Data Visualization

2.4.1. Setting

2.4.1.1. set graph size

2.4.1.1.1. plt.figure(figsize=(a,b))

2.4.1.2. show multiple graphs

2.4.1.2.1. plt.subplot()

2.4.1.3. set names

2.4.1.3.1. x-label

2.4.1.3.2. y-label

2.4.1.3.3. title

2.4.1.4. set grid

2.4.1.4.1. plt.grid(True)

2.4.1.5. limit x, y coordinates

2.4.1.5.1. plt.xlim([a,b])

2.4.1.5.2. plt.ylim([c,d])

2.4.1.6. insert text

2.4.1.6.1. g.text(x=, y=, s=)

2.4.2. Plotting

2.4.2.1. Matplotlib

2.4.2.1.1. errorbar

2.4.2.1.2. barplot

2.4.2.1.3. scatterplot

2.4.2.1.4. pie-chart

2.4.2.1.5. histogram

2.4.2.1.6. boxplot

2.4.2.2. Seaborn

2.4.2.2.1. pointplot

2.4.2.2.2. barplot

2.4.2.2.3. countplot

2.4.2.2.4. facetgrid

3. Statistics

3.1. Fundamentals

3.1.1. Types

3.1.1.1. Descriptive Statistics

3.1.1.2. Inferential Statistics

3.1.2. Sampling Types

3.1.2.1. Simple Random Sampling

3.1.2.2. Systematic Sampling

3.1.2.2.1. sampling with certain rules

3.1.2.3. Stratified Random Sampling

3.1.2.3.1. into groups ==> sampling within groups

3.1.2.4. Cluster Sampling

3.1.2.4.1. into groups ==> sampling group

3.1.3. Numpy Probability Density Functions

3.1.3.1. binomial

3.1.3.1.1. np.random.binomial(n = , p = , size = )

3.1.3.2. poisson

3.1.3.2.1. np.random.poisson(lam = , size = )

3.1.3.3. normal distribution

3.1.3.3.1. np.random.normal(loc = mean, scale = standard deviation, size = )

3.1.4. Parameters

3.1.4.1. mean

3.1.4.1.1. df.mean(axis=)

3.1.4.2. variance

3.1.4.3. standard deviation

3.1.4.4. standard error of the mean (SEM)

3.1.5. Concept

3.1.5.1. Law of Large Numbers

3.1.5.1.1. Sampling Size가 커질수록, Sample의 통계치는 모집단의 모수와 같아진다

3.1.5.2. Central Limit Theorem

3.1.5.2.1. Sample Data가 많아질수록, Sample의 평균은 정규분포에 근사한다

3.1.6. Others

3.1.6.1. randomly select

3.1.6.1.1. np.random.choice(data, size = )

3.2. Tests

3.2.1. Student T-test

3.2.1.1. Conditions

3.2.1.1.1. paired(similar) data

3.2.1.1.2. similar variance

3.2.1.1.3. normally distributed

3.2.1.2. Test Types

3.2.1.2.1. One-Sided Test

3.2.1.2.2. Two-Sided Test

3.2.1.2.3. Anova

3.2.1.3. Conclusion

3.2.1.3.1. p-value

3.2.2. Chisquare Test

3.2.2.1. Conditions

3.2.2.1.1. Categorical Data

3.2.2.2. Test Types

3.2.2.2.1. One Sample Test

3.2.2.2.2. Two Sample Test

3.2.2.3. Conclusion

3.2.2.3.1. p-value

3.2.3. Normal Test

3.2.3.1. normaltest(Sample_Data)

3.2.3.2. Conclusion

3.2.3.2.1. p-value

3.3. Confidence Interval

3.3.1. Finding Interval

3.3.1.1. Simple Method

3.3.1.1.1. t.interval(alpha = , dof = , loc = mean, scale = std_err)

3.4. Bayesian Inference

3.4.1. Concept

3.4.1.1. Original

3.4.1.1.1. True & Positive

3.4.1.1.2. True & Negative

3.4.1.1.3. False & Positive

3.4.1.1.4. False & Negative

3.4.1.2. Update

3.4.1.2.1. True

3.4.1.2.2. False

3.4.2. Bayesian Confidence Interval

3.4.2.1. mean_CI, _, _ = stats.bayes_mvs(data, alpha = )

4. Linear Algebra

4.1. Vector / Matrix

4.1.1. Fundamentals

4.1.1.1. Concept

4.1.1.1.1. Scalar

4.1.1.1.2. Vector

4.1.1.2. Shape

4.1.1.2.1. array.shape

4.1.1.3. Norm

4.1.1.3.1. np.linalg.norm(array)

4.1.1.4. Determinant

4.1.1.4.1. np.linalg.det(array)

4.1.1.5. Inverse

4.1.1.5.1. np.linalg.inv(array)

4.1.1.6. Transpose

4.1.1.6.1. array.T

4.1.1.7. Unit Vector

4.1.1.7.1. vectors with magnitude = 1

4.1.1.8. Orthogonal

4.1.1.8.1. vectors in 90 degrees

4.1.1.9. Orthonormal

4.1.1.9.1. orthogonal vectors & 90 degrees

4.1.2. Dot Product

4.1.2.1. np.dot(array1 , array2)

4.1.3. Matrix Types

4.1.3.1. Diagonal

4.1.3.2. Upper

4.1.3.3. Lower

4.1.3.4. Identity

4.1.3.4.1. np.identity(n)

4.1.3.5. Symmetric

4.1.3.5.1. array == array.T

4.1.4. Span

4.1.4.1. Linearly Independent

4.1.4.1.1. column vectors become basis vectors

4.1.4.2. Linearly Dependent

4.1.4.2.1. only linearly independent column vectors become basis vectors

4.1.5. Rank

4.1.5.1. number of basis vectors

4.1.6. Linear Projection

4.1.6.1. Project vector A onto vector B

4.1.6.1.1. data loss 발생하지만 무의미할 정도이다

4.2. Statistics Related

4.2.1. Mean

4.2.1.1. array.mean()

4.2.2. Variance

4.2.2.1. Population

4.2.2.1.1. array.var(ddof = 0)

4.2.2.2. Sample

4.2.2.2.1. array.var(ddof = 1)

4.2.3. Standard Deviation

4.2.3.1. Population

4.2.3.1.1. array.std(ddof = 0)

4.2.3.2. Sample

4.2.3.2.1. array.std(ddof = 1)

4.2.4. Covariance

4.2.4.1. array-like

4.2.4.1.1. np.cov(array1, array2, ddof = )

4.2.4.2. dataframe-like

4.2.4.2.1. df.cov()

4.2.5. Correlation Coefficient

4.2.5.1. array-like

4.2.5.1.1. np.corrcoef(array1, array2)

4.2.5.2. dataframe-like

4.2.5.2.1. df.corr()

4.3. Dimension Reduction

4.3.1. Concepts

4.3.1.1. Linear Transformation

4.3.1.1.1. ex) Linear Projection

4.3.1.2. Eigenvector

4.3.1.2.1. Linear Transformation에 영향을 받지 않는 Vector

4.3.1.3. Eigenvalue

4.3.1.3.1. Eigenvector의 변화량

4.3.2. Types

4.3.2.1. Feature Selection

4.3.2.2. Feature Extraction

4.3.3. PCA (Principal Components Analysis)

4.3.3.1. Steps

4.3.3.1.1. Word

4.3.3.1.2. Code

4.3.3.2. Analysis

4.3.3.2.1. Eigenvectors

4.3.3.2.2. Eigenvalues

4.3.3.2.3. Eigenvalues in ratio

4.3.3.2.4. Projected Data

4.4. Create Random Samples

4.4.1. make_blobs(n_features = , n_samples = , centers = , random_state = , cluster_std = )

4.5. Clustering

4.5.1. Types

4.5.1.1. Hierarchical

4.5.1.1.1. Aggolomerative

4.5.1.1.2. Divisive

4.5.1.2. Point Assignment

4.5.1.2.1. 시작 시, cluster의 갯수를 정하고 할당

4.5.1.3. Hard vs Soft Clustering

4.5.1.3.1. Hard Clustering

4.5.1.3.2. Soft Clustering

4.5.2. K-Means Clustering

4.5.2.1. Steps

4.5.2.1.1. Word

4.5.2.1.2. Code

4.5.2.2. Analysis

4.5.2.2.1. coordinates of cluster centers

4.5.2.2.2. labels of each point

4.5.2.2.3. kmeans.inertia_

4.5.2.2.4. number of iterations

4.5.2.3. Optimization

4.5.2.3.1. Elbow Method

4.5.3. Hierarchical Clustering

4.5.3.1. Visualization

4.5.3.1.1. Code

4.5.3.1.2. Method

5. Machine Learning

5.1. Supervised Learning (Labels)

5.1.1. Classification

5.1.1.1. classify to catogory

5.1.2. Prediction

5.1.2.1. predict using continuous data

5.2. Unsupervised Learning (No Labels)

5.2.1. Clustering

5.2.2. Dimension Reduction

5.2.2.1. Feature Extraction

5.2.2.2. Feature Selection

5.2.3. Association Rule Learning

5.2.3.1. find relationship between features

5.3. Reinforcement Learning

5.3.1. 좋은 행동

5.3.1.1. 보상

5.3.2. 좋지 못한 행동

5.3.2.1. 처벌