Python for data analysis with Seaborn

Data analysis using Python with Seaborn

Comienza Ya. Es Gratis
ó regístrate con tu dirección de correo electrónico
Python for data analysis with Seaborn por Mind Map: Python for data analysis with Seaborn

1. Scatterplot with jointplot() function

1.1. The jointplot() function allows you to combine two distributions (displot) and by default visualise the correlations between the two distributions as a scatterplot

1.1.1. We need to pass 3 mandatory arguments: x, y and data

1.1.1.1. data must reference a DataFrame variable

1.1.1.2. x and y must reference columns from the DataFrame referenced by the data argument

1.1.1.3. With the tips variable already set to the built-in tips DataFrame, we can call this code

1.1.1.3.1. sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')

2. Multiplots for all DataFrame numeric columns with pairplot() function

2.1. The pairplot() function will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns)

2.1.1. sns.pairplot(tips)

2.1.1.1. returns

2.1.1.1.1. see attached

2.1.2. sns.pairplot(tips,hue='sex',palette='coolwarm')

2.1.2.1. returns

2.1.2.1.1. see attached

2.2. Essentially a pairplot is just a bunch of jointplots for all paired combinations of numeric columns in your DataFrame

3. Rugplots

3.1. The rugplot is a simple one and its function works in a similar way to displot(), taking a single variable (or column from a DataFrame)

3.2. The rugplot plots a small straight line for each data point across the x-axis, which you can imagine as a fibre in a rug

3.2.1. The rug "fibres" are more densely packed in certain areas and more thinly distributed in others, thereby communicating visual information about distribution

3.3. Here is an example:

3.3.1. sns.rugplot(tips['total_bill'])

3.3.1.1. returns

3.3.1.1.1. see attached

3.4. Rugplots are a gateway to understanding KDE plots

4. KDE plots

4.1. KDE stands for Kernel Density Estimation

4.2. Conceptually, these plots are the product of a rugplot with a standard normal distribution (a.k.a. Gaussian distribution) overlayed for every rug "fibre", and then the various normal distributions are summed to give a single wavy line plot, which is the kdeplot

4.2.1. see attached for a visualisation of a rugplot with normal distributions layered over them

4.2.2. Here's an example of a rugplot with its equivalent kdeplot

4.2.2.1. sns.kdeplot(tips['total_bill']) sns.rugplot(tips['total_bill'])

4.2.2.1.1. returns

5. Plotting categorical data with barplot() function

5.1. Bar plots are a standard way to plot categorical data

5.1.1. Think of them as a group by category applying some aggregate function

5.1.1.1. By default, the barplot() function applies the mean function so if we want a bar plot with the average, we can do something like this:

5.1.1.1.1. sns.barplot(x='sex',y='total_bill',data=tips)

5.2. The argument to change the aggregate function is named estimator

5.2.1. We can pass in functions from the numpy library if we want

5.2.1.1. import numpy as np

5.2.1.1.1. ax = sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std) ax.set_title("Total bill standard deviation by sex")

6. Plotting categorical data with the boxplot() function

6.1. A box plot is also known as a box and whisker plot

6.1.1. It takes some categorical variable and plots 5 metrics (based on a numerical variable in same data set) for each: minimum 1st quartile (Q1) median 3rd quartile (Q3) maximum

6.1.1.1. This is a way of visualising distribution of some numeric by a category and comparing that distribution per category

6.1.1.2. Q1 represents the 25th percentile, where the 1st percentile includes the lowest numbers

6.1.1.3. Q3 represents the 75th percentile, where the 100th percentile includes the highest numbers

6.1.1.4. The "box" consists of the Inter Quartile Range (IQR), which encapsulates Q1 the median (line through the box) and Q3

6.1.1.5. The "whiskers" are lines extending from Q1 to minimum and from Q3 to maximum

6.1.1.6. Some numbers are calculated to be outliers, and these are represented by dots either side of a whisker

6.2. In this example, we call the boxplot() function by passing in the built-in tips DataFrame (which I pre-assigned to tips variable). The categorical day column is passed as the x value and the numerical total_bill column is passed as the y value. We also pass the optional palette argument.

6.2.1. sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')

6.2.1.1. returns

6.2.1.1.1. see attached

6.2.2. We can also add the hue argument to get even more information about the distributions. In this case, we'll add the categorical smoker column for the hue argument.

6.2.2.1. sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm")

6.2.2.1.1. returns

6.2.3. sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')

7. Numeric vs categorical scatterplots with stripplot() function

7.1. Stripplots are used to show distribution of a categorical variable and a numeric variable. They can be used on their own, but are often used as supplementary plots with boxplots or violinplots.

7.1.1. They take similar arguments to boxplot() and violinplot()

7.1.1.1. sns.stripplot(x="day", y="total_bill", data=tips)

7.1.1.1.1. returns

7.1.1.2. sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',split=True)

7.1.1.2.1. returns

7.1.1.2.2. Note: this produces a warning that "split" parameter has been renamed to "dodge"

8. Using factorplot() function for creating multiple types of plot

8.1. The factorplot() function is a available as a more general function that allows you to specify the type of plot you want by passing the kind argument

8.1.1. For example, we can call factorplot() rather than barplot() as follows:

8.1.1.1. sns.factorplot(x='day',y='total_bill',data=tips,kind='bar')

8.1.1.1.1. returns

8.1.2. Or use factorplot() to produce a violinplot

8.1.2.1. sns.factorplot(x='day',y='total_bill',data=tips,kind='violin')

8.1.2.1.1. returns

8.2. Using factorplot() will generate a warning that this function is now renamed to catplot()

9. Grid plots with PairGrid() function

9.1. Grid plots are a general plot type that allow you to customise a mix of different plot types on a single canvas

9.2. pairplot() is like a specific implementation of PairGrid()

9.3. Setup

9.3.1. import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

9.3.1.1. iris = sns.load_dataset('iris')

9.3.1.1.1. iris.head()

9.4. If we call PairGrid() with the iris DataFrame, we get a 4x4 grid of empty plots due to this DataFrame having 4 numeric columns

9.4.1. sns.PairGrid(iris)

9.4.1.1. returns

9.4.1.1.1. see attached

9.5. We can apply a particular type of plot to a PairGrid by first capturing it in a variable and then invoking the map() method, which takes a plot function as an argument.

9.5.1. g = sns.PairGrid(iris) g.map(plt.scatter)

9.5.1.1. returns

9.5.1.1.1. see attached

9.6. We can go further and make the diagnonal plots (top left to bottom right) one type, and then different plot types for the upper plots (above diagonal) and lower plots. This customisation involves the use of 3 methods of a PairGrid: map_diag(), map_upper() and map_lower().

9.6.1. g = sns.PairGrid(iris) g.map_diag(plt.hist) g.map_upper(plt.scatter) g.map_lower(sns.kdeplot)

9.6.1.1. returns

9.6.1.1.1. see attached

10. Grid plots with FacetGrid() function

10.1. The FacetGrid() function is an alternative type of general grid function to PairGrid(). In addition to the DataFrame, it takes a col and row argument, which must reference specific columns in the DataFrame.

10.1.1. If we use the tips DataFrame with FacetGrid() and pass smoker and time for the col and row, we'll get a 2x2 grid because this set features two distinct values for each of those columns (smoker = yes/no) and time = (lunch/dinner).

10.1.1.1. tips = sns.load_dataset('tips')

10.1.1.2. Here we use FacetPlot to show histograms for total_bill by the smoker and time.

10.1.1.2.1. g = sns.FacetGrid(tips, col="time", row="smoker") g = g.map(plt.hist, "total_bill")

11. Controlling style and colour of plots

11.1. We've already seen how to change the appearance of plots but this groups the concept together more formally

11.2. set_style() function

11.2.1. There are a small number of style arguments we can pass in as strings: darkgrid, whitegrid, dark, white, ticks

11.2.1.1. sns.set_style('darkgrid') sns.countplot(x='sex',data=tips)

11.2.1.1.1. returns

11.3. despine() function

11.3.1. The "spine" refers to the box drawn around the plot

11.3.1.1. Some of the default param values for despine() are: top=True, right=True, left=False, bottom=False

11.3.1.1.1. So, despine() without any args passed will remove the top and right spine

11.4. Using Matplotlib's figure() function to control height and width of Seaborn plots

11.4.1. Majority of Seaborn plots can be resized by using the Matplotlib's (aliased as plt) figure function

11.4.1.1. plt.figure(figsize=(12,3)) sns.countplot(x='sex',data=tips)

11.4.1.1.1. returns

11.5. Using height and aspect params to control height and width of Seaborn grid plots

11.5.1. sns.lmplot(x='total_bill',y='tip',height=2,aspect=4,data=tips)

11.5.1.1. returns

11.5.1.1.1. see attached

11.6. set_context() function

11.6.1. There are a small number of context arguments we can pass in as strings: paper, notebook, talk, poster

11.6.1.1. We can pair this by adjusting the font_scale value from its default value of 1 to gain more control over the font size in the plot

11.6.1.1.1. sns.set_context('poster',font_scale=0.75) sns.countplot(x='sex',data=tips,palette='coolwarm')

12. Why Seaborn?

12.1. It's a statistical plotting library built on top of Matplotlib

12.2. It works really well with Pandas

13. Installing Seaborn

13.1. conda install seaborn

13.2. pip install seaborn

14. Seaborn documentation

14.1. Seaborn is open source and hosted on Github

14.1.1. A Google search of "github seaborn python" should get you there

14.1.1.1. On the Github page is a link to the official documentation site

14.1.1.1.1. The API reference will be useful

15. Importing Seaborn

15.1. I'm not sure why, but the generally accepted convention for aliasing the seaborn library is apparent sns

15.1.1. import seaborn as sns %matplotlib inline

16. Plotting categorical data with the countplot() function

16.1. The countplot() function produces a bar chart like the barplot() function but it is simpler because it applies a count and only requires the x argument (as the y is auto-set to a count)

16.1.1. sns.countplot(x='sex',data=tips)

16.1.1.1. returns

16.1.1.1.1. see attached

17. Plotting categorical data with the violinplot() function

17.1. The violinplot() function is similar to the boxplot.

17.1.1. Each "violin" in the violinplot includes a boxplot at the centre and then shaded KDEs either side that show how the data is distributed

17.2. Violinplots are considered more informative by data scientists but lay people find them harder to read and prefer boxplots. Therefore, you need to consider your audience when deciding between boxplots and violinplots.

17.3. Example:

17.3.1. sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')

17.3.1.1. returns

17.3.1.1.1. see attached

17.4. As with boxplot, we can add hue, but we can also go a step further and add the split argument, which enables a direct KDE comparison of the 2nd category

17.4.1. sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1')

17.4.1.1. returns

17.4.1.1.1. see attached

18. Combining concepts of stripplot and violinplot into a swarmplot

18.1. A swarmplot attempts to combine the idea of stripplot and violinplot into one so as to avoid the need for both

18.1.1. sns.swarmplot(x="day", y="total_bill", data=tips)

18.1.1.1. returns

18.1.1.1.1. see attached

18.1.2. However, it also be interesting to combine a swarmplot with a violinplot, as the swarmplot helps explain the shape of the violinplot

18.1.2.1. sns.violinplot(x="day", y="total_bill", data=tips) sns.swarmplot(x="day", y="total_bill", data=tips,color="black")

18.1.2.1.1. returns

18.2. Swarmplots don't work so well for larger datasets because they struggle to plot all of the data points and also it requires a lot of compute resource

18.3. Swarmplots are also more likely to be unfamiliar to a target audience, so this is also an important consideration when choosing plots

19. Matrix plot with the heatmap() function

19.1. In order to call the heatmap() function, you must first prepare a DataFrame in a matrix format

19.1.1. A matrix format means that we need variable labels on both columns and rows (i.e. not just on the columns)

19.1.2. There are two ways to produce a DataFrame in a matrix format: the corr() method and the pivot_table() method

19.1.3. Here's how we can prepare the built-in tips DataFrame in a matrix format using the corr() method

19.1.3.1. tips = sns.load_dataset('tips')

19.1.3.1.1. tips.head()

19.1.3.1.2. tc = tips.corr()

19.1.4. Here's how we can prepare the built-in flights DataFrame in a matrix using the pivot_table() method

19.1.4.1. flights = sns.load_dataset('flights')

19.1.4.1.1. flights.head()

19.1.4.1.2. fp = flights.pivot_table(values='passengers',index='month',columns='year')

19.2. Once we have a variable referencing a DataFrame in matrix format, we can plot a heatmap by passing that variable as the first argument

19.2.1. sns.heatmap(tc)

19.2.1.1. returns

19.2.1.1.1. see attached

19.3. We can change the default colour scheme with the cmap argument and add the numeric annotation for each heat map cell using the annot argument

19.3.1. sns.heatmap(tc,cmap='coolwarm',annot=True)

19.3.1.1. returns

19.3.1.1.1. see attached

19.4. We can also customise heatmap appearance further with linecolor and linewidths arguments, both of which relate to the shaded grids of the heatmap

19.4.1. sns.heatmap(fp,cmap='magma',linecolor='white',linewidths=1)

19.4.1.1. returns

19.4.1.1.1. see attached

20. Histogram with displot() function

20.1. Seaborn comes with a number of built-in DataFrames, which are handy for experimentation

20.1.1. One of these built-in DataFrames is called tips and represents fictitious restaurant data of anonymous customers and their tipping activity

20.1.1.1. For a full list of the built in DataFrames check out the attached link on GitHub

20.1.1.2. We can take advantage of these built-in DataFrames by invoking Seaborn's load_dataset() function

20.1.1.2.1. tips = sns.load_dataset('tips')

20.2. Note that my Udemy course demonstrated the function named distplot() but when I took the course this generates deprecation warnings and directs you to use either displot() for figure-level histogram plots or histplot() for axes-level

20.3. Based on the built-in tips DataFrame, we can generate a histogram as follows:

20.3.1. sns.displot(tips['total_bill'])

20.3.1.1. returns

20.3.1.1.1. see attached

20.4. We can change the number of bins in the histogram by passing the bins argument

20.4.1. sns.displot(tips['total_bill'],bins=30)

20.4.1.1. returns

20.4.1.1.1. see attached

20.5. The "dis" in displot stands for distribution

20.5.1. It's basically a function for taking a single variable (i.e. a single column of values) and visualising its distribution via a histogram

21. Matrix plot with the clustermap() function

21.1. The clustermap() function works like heatmap() but applies clustering algorithm to help identify interesting clusters in the matrix

21.1.1. Here's an example with the flights DataFrame pivoted and referenced by the variable "fp"

21.1.1.1. sns.clustermap(fp)

21.1.1.1.1. returns

21.1.2. We can make a clustermap even clearer when we apply a standard_scale of 1, which normalises the data to fit into the 0 to 1 range. In this case we also apply the coolwarm colour scheme (cmap option), which also helps emphases the "high" months vs the "low" months.

21.1.2.1. sns.clustermap(fp,cmap='coolwarm',standard_scale=1)

21.1.2.1.1. returns

22. Linear regression plots with the lmplot() function

22.1. Using lmplot(), we can pass in the tips DataFrame together with a value for x and y, which produces a scatterplot with a linear regression line

22.1.1. sns.lmplot(x='total_bill',y='tip',data=tips)

22.1.1.1. returns

22.1.1.1.1. see attached

22.1.2. We can use hue to see a visual contrast on another attribute like sex

22.1.2.1. sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex')

22.1.2.1.1. returns

22.1.3. Alternatively to hue, we could use col instead to get two linear plots side by side

22.1.3.1. sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')

22.1.3.1.1. returns