Get Started. It's Free
or sign up with your email address
Data Science by Mind Map: Data Science

1. Data

1.1. Cleaning Data

1.2. Transform Data

2. Machine Learning

2.1. Types of Machine Learning

2.1.1. Human Supervision

2.1.1.1. Supervised; training data you feed algo includes desired solutions, called labels

2.1.1.1.1. Classification; train with examples of spam or ham email. Must learn how to classify new emails

2.1.1.1.2. Regression; predict a target numeric value (price of car), given features (milage, age, brand, etc.) called predictors. (Some regression algos can be used for classification)

2.1.1.1.3. k-Nearest Neighbors

2.1.1.1.4. Support Vector Machines (SVMs)

2.1.1.1.5. Decision Trees and Random Forests

2.1.1.1.6. Neural Networks (*can also be unsupervised and semisupervised)

2.1.1.2. Unsupervised; the training data is not labeled; learn without a teacher

2.1.1.2.1. Clustering

2.1.1.2.2. Visualization and dimensionality reduction; Visualization - output a 2d/3d representation of data to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping); Dimensionality reduction - goal is to simplify the data without losing too much info (can merge several correlated features into one, called feature extraction) => good idea to reduce dimension of training data using a dimensionality reduction algo before feeding it to another ML algo. 1. Will runfaster 2. Take up less memory 3. May perform better in some cases

2.1.1.2.3. Association rule learning; the goal is to dig into large amounts of data and discover interesting relations between attributes

2.1.1.2.4. Anomoly detection; automatically remove outliers from a dataset before feeding to another algo; system is trained with normal instances, when it sees a new one it can tell if its normal or an anomaly

2.1.1.3. Semisupervised; partially labeled training data => usually a lot of unlabeled data and a little bit of labeled data (cluster similar looking pictures, just need to label the person) Usually are combinations of unsupervised and supervised algos

2.1.1.3.1. Deep belief networks (DBNs); Based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another, then fine-tuned using supervised learning techniques

2.1.1.4. Reinforcement Learning; Agent => Learning system, it can observe the environment, select and perform actions and get rewards or penalties in return. Policy => what the best strategy the algo learns by itself to get the most reward over time. A policty defines what action the agent should choose when it is in a given situation

2.1.1.4.1. For example; Robots learn how to walk, DeepMind's AlphaGo program

2.1.2. Learning Style

2.1.2.1. Online/Batch Learning

2.1.2.1.1. Batch learning; Incrementally learn

2.1.2.1.2. Online learning; Learn on the fly from a stream of incoming data

2.1.2.2. Instance-Based/Model-Based; (How well ML systems generalize; ML is about making predictions, with new data never seen before needs to be able to generalize)

2.1.2.2.1. Instance-Based

2.1.2.2.2. Model-Based

2.2. Main challenges of machine learning; (Main task is to select a learning algo and train it on some data => two things that can go wrong are 'bad data' and 'bad algo'

2.2.1. Bad Data

2.2.1.1. Insufficient Quantity of Training Data

2.2.1.1.1. Simple problems need thousands of examples

2.2.1.1.2. Complex problems like image and speech recognition need millions of examples (unless reusing parts of an existing model)

2.2.1.1.3. In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different Machine Learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language disambiguation8 once they were given enough data. As the authors put it: “these results suggest that we may want to reconsider the trade- off between spending time and money on algorithm development versus spending it on corpus development.” The idea that data matters more than algorithms for complex problems was further popularized by Peter Norvig et al. in a paper titled “The Unreasonable Effectiveness of Data” published in 2009.10 It should be noted, however, that small- and medium- sized datasets are still very common, and it is not always easy or cheap to get extra training data, so don’t abandon algorithms just yet.

2.2.1.2. Nonrepresentative Training Data

2.2.1.2.1. To generalize well, need training data to be representative of new cases you want to generalize to. (True for instance and model-based learning)

2.2.1.2.2. Too small sample size you will have sampling noise (nonrepresentative data as a result of chance). Large data samples can be nonrepresentative if the sampling method is flawed. Called sampling bias.

2.2.2. Bad Algo

2.3. Terms

2.3.1. Attribute; data type (mileage)

2.3.2. Feature; general means an attribute plus its value (Mileage = 15,000) (can use feature and attribute interchangeably)

2.3.3. Feature extraction; a type of dimensionality reduction, the goal is to simplify the data without losing too much info but merging several correlated features into one

2.3.4. Structured Data;

2.3.5. Unstructured Data;

2.4. Steps for the process

2.4.1. Data Preparation

2.4.1.1. feature selection

2.4.1.2. train/test splittiing

2.4.1.3. sampling

2.5. Feature Engineering; https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114#3149

2.5.1. 1. Imputation (what to do with missing values) (Some algos drop rows, some dont accept datasets with missing values)

2.5.1.1. Drop based on a threshold (70% of values not in then drop row)

2.5.1.2. Numerical Imputation

2.5.1.2.1. Medians of the columns; Best solution (Average of the columns are sensitive to outliers)

2.5.1.2.2. Have a default value of missing numbers (Can missing values assume to be zero/NA?)

2.5.1.2.3. Interpolation methods

2.5.1.3. Categorical Imputation

2.5.1.3.1. Replace missing values with the maximum occurred value in a column (need to check distribution though => if uniformly distributed and no dominant value, inputting a category like 'Other' might be more sensible since imputation is likely to converge to a random selection).

2.5.2. 2. Handling Outliers; Best way to detect outliers is to visualize the outliers

2.5.2.1. Statistical Methodology; Less precise but are fast

2.5.2.1.1. Standard Deviation; no trivial solution, but 2 to 4 seems practical

2.5.2.1.2. Percentiles; set percentage value depending on distribution of data

2.5.2.2. Outlier Dilemma: Drop or Cap

2.5.2.2.1. Capping them so you can keep data size, which might be better for the final model performance

2.5.2.2.2. Capping can effect the distribution of the data, thus not better to exaggerate it.

2.5.3. 3. Binning

2.5.3.1. Can be applied to categorical and numerical data

2.5.3.2. Main motivation is to make the model more robust and prevent overfitting, however, it has a cost to the performance.

2.5.3.2.1. Every time you bid something => sacrifice info and make data more regularized.

2.5.3.3. For numerical columns, except for obvious overfitting cases, binning might be redundant for some kind of algorithms, due to its effect on model performance.

2.5.3.4. For categorical data, the labels with low frequencies probably affect the robustness of statistical models negatively. So assigning a general category to these less frequent values helps to keep the robustness of the model.

2.5.3.4.1. #Numerical Binning Example data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"]) value bin 0 2 Low 1 45 Mid 2 7 Low 3 85 High 4 28 Low #Categorical Binning Example Country 0 Spain 1 Chile 2 Australia 3 Italy 4 Brazil conditions = [ data['Country'].str.contains('Spain'), data['Country'].str.contains('Italy'), data['Country'].str.contains('Chile'), data['Country'].str.contains('Brazil')] choices = ['Europe', 'Europe', 'South America', 'South America'] data['Continent'] = np.select(conditions, choices, default='Other') Country Continent 0 Spain Europe 1 Chile South America 2 Australia Other 3 Italy Europe 4 Brazil South America

2.5.4. 4. Log Transformation (most commonly used mathematical transformations)

2.5.4.1. Helps handle skewed data => distribution becomes more approx. to normal

2.5.4.2. Magnitude of order changes => want to capture magnitude of changes, not absolute of change

2.5.4.3. Decreases the effect of outliers; due to the normalization of magnitude differences and the model become more robust

2.5.4.4. A critical note: The data you apply log transform must have only positive values, otherwise you receive an error. Also, you can add 1 to your data before transform it. Thus, you ensure the output of the transformation to be positive. (Log (X+1))

2.5.4.4.1. #Log Transform Example data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]}) data['log+1'] = (data['value']+1).transform(np.log) #Negative Values Handling #Note that the values are different data['log'] = (data['value']-data['value'].min()+1) .transform(np.log) value log(x+1) log(x-min(x)+1) 0 2 1.09861 3.25810 1 45 3.82864 4.23411 2 -23 nan 0.00000 3 85 4.45435 4.69135 4 28 3.36730 3.95124 5 2 1.09861 3.25810 6 35 3.58352 4.07754 7 -12 nan 2.48491

2.5.5. 5. One-Hot Encoding (popular encoding methods in ML)

2.5.5.1. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.

2.5.5.2. Changes Categorical data to numerical data => enables grouping of categorical data without losing info

2.5.5.3. If you have N distinct values in the column, it is enough to map them to N-1 binary columns, because the missing value can be deducted from other columns. (Helps not use multicollinearity).

2.5.5.4. encoded_columns = pd.get_dummies(data['column']) data = data.join(encoded_columns).drop('column', axis=1)

2.5.6. 6. Grouping Operations; In most machine learning algorithms, every instance is represented by a row in the training dataset, where every column show a different feature of the instance. This kind of data called “Tidy”. Datasets such as transactions rarely fit the definition of tidy data above, because of the multiple rows of an instance.

2.5.6.1. The key point of group by operations is to decide the aggregation functions of the features.

2.5.6.2. Numerical features

2.5.6.2.1. average and sum functions are usually done

2.5.6.2.2. Can obtain ratio columns, by using the average of binary columns

2.5.6.3. Categorical features

2.5.6.3.1. Select label with the highest frequency

2.5.6.3.2. Make a pivot table

2.5.6.3.3. Applying a group by function after applying one-hot encoding.

2.5.7. 7. Feature Split (good for making features useful for ML)

2.5.7.1. Extract usable parts of string columns

2.5.7.1.1. Enable ML algos to comprehend them

2.5.7.1.2. Make possible to bin and group them

2.5.7.1.3. Improve model performance by uncovering potential info

2.5.7.2. Split function is a good option, however, there is no one way of splitting features. It depends on the characteristics of the column, how to split it. Two examples...

2.5.7.2.1. This example is a simple split function for an ordinary name column. The example handles the names longer than two words by taking only the first and last elements and it makes the function robust for corner cases, which should be regarded when manipulating strings like that.

2.5.7.2.2. Another case for split function is to extract a string part between two chars. The following example shows an implementation of this case by using two split functions in a row.

2.5.8. 8. Scaling (most variables do not have the same range => scaling helps machine learning solve this problem and helps the machine learning algo learn better) (not mandatory but best to apply) (algos based on distance calculations like k-NN or k-Means will need to have scaled continuous features as model input)

2.5.8.1. Normalization (or min-max normalization)

2.5.8.1.1. Scales all values in a fixed range between 0 and 1.

2.5.8.1.2. Does not change the distribution of the feature and due to the decreases standard deviations, the effects of the outliers increases. So before normalization, it is recommended to handle outliers.

2.5.8.1.3. x_norm = (x - x_min)/(x_max - x_min)

2.5.8.2. Standardization (or z-score normalization)

2.5.8.2.1. Scales values while taking into account standard deviation. If the standard deviation of features is different, their range will also differ from each other. Reducing the effect of the outliers in the features.

2.5.8.2.2. z = (x - mu) / std.dev

2.5.9. 9. Extracting Date (dates are sometimes hard to understand by alagos and can be represented in numerous formats)

2.5.9.1. Can extract the parts of the date into different columns: year, month, day, etc.

2.5.9.2. Can Extract the time period between the current date and columns in terms of years months, days, etc.

2.5.9.3. Can Extract some specific features from the date: Name of weekday, weekend or not, holiday or not, etc.

2.5.9.4. from datetime import date data = pd.DataFrame({'date': ['01-01-2017', '04-12-2008', '23-06-1988', '25-08-1999', '20-02-1993', ]}) #Transform string to date data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y") #Extracting Year data['year'] = data['date'].dt.year #Extracting Month data['month'] = data['date'].dt.month #Extracting passed years since the date data['passed_years'] = date.today().year - data['date'].dt.year #Extracting passed months since the date data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month #Extracting the weekday name of the date data['day_name'] = data['date'].dt.day_name() date year month passed_years passed_months day_name 0 2017-01-01 2017 1 2 26 Sunday 1 2008-12-04 2008 12 11 123 Thursday 2 1988-06-23 1988 6 31 369 Thursday 3 1999-08-25 1999 8 20 235 Wednesday 4 1993-02-20 1993 2 26 313 Saturday

3. Statistical Modeling