Python and linear regression

Linear regression using Python

Kom i gang. Det er Gratis
eller tilmeld med din email adresse
Python and linear regression af Mind Map: Python and linear regression

1. Split dataset into training and test datasets

1.1. Once we have our X and y arrays defined, we need to split the data into two, one set for training the model and another set for testing it

1.1.1. from sklearn.model_selection import train_test_split

1.1.2. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

1.1.2.1. In this case we've set the test set size to be 40% of the original set, but another value often used is 0.3 (30%)

1.1.2.2. Using random_state is entirely optional and the number can be anything; we use 101 here simply to ensure the random split of training and test data is done in an identical way to the demo by the Udemy course instructor

2. Create and train model

2.1. Once we have our data split into training and test sets, we are ready to create and train our model

2.1.1. from sklearn.linear_model import LinearRegression

2.1.2. lm = LinearRegression()

2.1.3. lm.fit(X_train,y_train)

3. Model evaluation (pre testing)

3.1. After training a model we can review the intercept value and the coefficient values for each feature

3.1.1. Note that the meaning of these values depends on the target; in our example, the "price" target represents values in USD, so the intercept and and coefficients can be interpreted as USD values

3.1.2. To understand the meaning of the intercept and coefficients, we must understand the equation that represents the models "best fit" line

3.1.2.1. If we imagine a model with 3 features, the equation would be:

3.1.2.1.1. y = a*x1 + b*x2 + c*x3 + C

3.2. Print the intercept

3.2.1. print(lm.intercept_)

3.3. Review the coefficients for each of the training set columns

3.3.1. coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient']) coeff_df

4. Testing model with predictions

4.1. Capture target predictions based on the test set

4.1.1. predictions = lm.predict(X_test)

4.2. The correct target predictions are already captured in y_test, so let's make a scatterplot of this data vs the predicted target values

4.2.1. plt.scatter(y_test,predictions)

4.2.1.1. A good result shows up as a clear clustering around a straight line

4.3. Visualise the residuals as a histogram

4.3.1. sns.histplot((y_test-predictions),bins=50,kde=True)

4.3.1.1. A normal distribution is the desired outcome

4.3.1.1.1. If you don't get a normal distribution, you should go back and re-assess whether the data and predictions suits a linear regression model, with a view to repeating the process with a different choice

5. Evaluate model with metrics

5.1. Import metrics

5.1.1. from sklearn import metrics

5.2. Get mean absolute error (MAE)

5.2.1. metrics.mean_absolute_error(y_test,predictions)

5.3. Get mean squared error (MSE)

5.3.1. metrics.mean_squared_error(y_test,predictions)

5.4. Get root mean squared error

5.4.1. np.sqrt(metrics.mean_squared_error(y_test,predictions))

5.5. A slightly neater way to display all 3 metrics is as follows:

5.5.1. print('MAE:', metrics.mean_absolute_error(y_test, predictions)) print('MSE:', metrics.mean_squared_error(y_test, predictions)) print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

5.5.1.1. Remember that RMSE is the most popular metric because it both punishes large errors and is interpretable in terms of your model's y units (house prices in USD in our example)

6. Theory

6.1. First developed by Francis Galton in 17th century

6.1.1. Based on study of father and son heights

6.1.1.1. Theory states that whilst sons tend to be the same height as their fathers, the height of sons tends to shift from their father's height towards the mean of the male population height

6.1.1.1.1. It's called regression due to this shifting towards the mean

6.1.1.2. A good analogy to understand this phenomenon is to consider the basketball player, Shaquille O'Neal, who is 7 foot 1 inch tall

6.1.1.2.1. His height is so exceptional and so far above the mean height, the theory says that his own son will not be so tall

6.1.1.3. Bringing the idea back to data visualisation, we can imagine a scatter plot that shows the correlation of father and son heights

6.1.1.3.1. The goal of linear regression is to fit a straight line to this scatter plot that minimises the distance between the data points and the line

7. Data profiling

7.1. After data acquisition, we commence with data profiling to get a better understanding of our data

7.1.1. Import libraries

7.1.1.1. import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

7.1.2. Read data from file into Pandas DataFrame

7.1.2.1. USAhousing = pd.read_csv('USA_Housing.csv')

7.1.2.1.1. Note: this particular file is fake data provided as part of my Udemy course

7.1.3. Peek at the head of the DataFrame

7.1.3.1. USAhousing.head()

7.1.4. How many rows and columns, and what are the inferred data types of the columns, and are there any missing values

7.1.4.1. USAhousing.info()

7.1.5. For the DataFrame's numeric values, what do these look like per column in terms of mean, min, max, standard deviation, etc.

7.1.5.1. USAhousing.describe()

7.1.6. What are the column names?

7.1.6.1. USAhousing.columns

7.1.6.1.1. Note that you see this in other profiling method outputs but it's handy nonetheless, and can highlight little gotchas around spacing in column names

8. Data cleaning

8.1. In the Udemy course lecture for linear regression, we used fake data that did not require any cleaning

8.1.1. See other mind maps for data cleaning approaches

9. Exploratory Data Analysis (EDA)

9.1. Look at the numerical column correlations with pairplot

9.1.1. sns.pairplot(USAhousing)

9.2. Look at the distribution of the numeric column we want our model to predict values for

9.2.1. sns.displot(USAhousing['Price'])

9.3. Look at numeric column correlations via a heatmap

9.3.1. sns.heatmap(USAhousing.corr(),annot=True,cmap='coolwarm')

10. Define X and y arrays

10.1. By convention, X (note the uppercase) represents the "features" of the dataset

10.1.1. These features must be numeric columns in your dataset when using a linear regression model

10.1.2. Other terms that you may encounter being used interchangeably with "feature":

10.1.2.1. Independent variable X-variable Attribute

10.1.3. The features are the variables considered by the model as the way to predict the target

10.2. By convention, y (note the lowercase) represents the "target" of the dataset

10.2.1. The target must be also be one of more numeric columns in your dataset when using linear regression

10.2.1.1. Normally, it will be a single column, although multiple regression models are possible

10.3. Start by listing the column names of your DataFrame, then set X and y

10.3.1. USAhousing.columns

10.3.1.1. Copy and paste column names for your X array

10.3.1.1.1. X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]

10.3.1.2. Copy and paste column name(s) for your y array

10.3.1.2.1. y = USAhousing['Price']

10.3.1.3. Disregard any textual columns