The question you pose is difficult to answer if taken literally. The difficulty stems from the fact that Show
Python is not directly involved in the generation of the image. So there is no straight-forward Python-based solution. Nevertheless, the issue of how to convert HTML to png
was raised on the pandas developers' github page and the suggested answer was to use We could
avoid much of this difficulty, however, if we loosen the interpretation of the question. Instead of trying to produce the exact image generated by
If you don't want to add the seaborn dependency, you could use matplotlib directly though it takes a few more lines of code: In this tutorial, you’ll learn how to calculate a correlation matrix in Python and how to plot it as a heat map. You’ll learn what a correlation matrix is and how to interpret it, as well as a short review of what the coefficient of correlation is. You’ll then learn how to calculate a correlation matrix with the pandas library. Then, you’ll learn how to plot the heat map correlation matrix using Seaborn. Finally, you’ll learn how to customize these heat maps to include certain values. The Quick Answer: Use Pandas’
What a Correlation Matrix is and How to Interpret itA correlation matrix is a common tool used to compare the coefficients of correlation between different features (or attributes) in a dataset. It allows us to visualize how much (or how little) correlation exists between different variables. This is an important step in pre-processing machine learning pipelines. Since the correlation matrix allows us to identify variables that have high degrees of correlation, they allow us to reduce the number of features we may have in a dataset. This is often referred to as dimensionality reduction and can be used to improve the runtime and effectiveness of our models. That’s the theory of our correlation matrix. But what does it actually look like? A correlation matrix has the same number of rows and columns as our dataset has columns. This means that if we have a dataset with 10 columns, then our matrix will have ten rows and ten columns. Each row and column represents a variable (or column) in our dataset and the value in the matrix is the coefficient of correlation between the corresponding row and column. What is a Correlation Coefficient? A coefficient of correlation is a value between
A negative coefficient will tell us that the relationship is negative, meaning that as one value increases, the other decreases. Similarly, a positive coefficient indicates that as one value increases, so does the other. Let’s see what a correlation matrix looks like when we map it as a heat map. Here, we have a simply 4×4 matrix, meaning that we have 4 columns and 4 rows. A sample correlation matrix visualized as a heat mapThe values in our matrix are the correlation coefficients between the pairs of features. We can see that we have a diagonal line of the values of 1. This is because these values represent the correlation between a column and itself. Because these values are, of course, always the same they will always be 1. If you have a keen eye, you’ll notice that the values in the top right are the mirrored image of the bottom left of the matrix. This is because the relationship between the two variables in the row-column pairs will always be the same. It’s common practice to remove these from a heat map matrix in order to better visualize the data. This is something you’ll learn in later sections of the tutorial. Pandas
makes it incredibly easy to create a correlation matrix using the DataFrame method,
By default, the Loading a Sample Pandas DataframeNow that you have
an understanding of how the method works, let’s load a sample Pandas Dataframe. For this, we’ll use the Seaborn
Let’s break down what we’ve done here:
We can see that our DataFrame has 7 columns. Some of these columns are numeric and others are strings. Calculating a Correlation Matrix with PandasNow that we have our Pandas DataFrame loaded, let’s use the
We can see that while our original dataframe had seven columns, Pandas only calculated the matrix using numerical columns. We can see that four of our columns were turned into column row pairs, denoting the relationship between two columns. For example, we can see that the
coefficient of correlation between the Rounding our Correlation Matrix Values with PandasWe can round the values in our matrix to two digits to make them easier to read. The matrix that’s returned is actually a Pandas Dataframe. This means that we can actually apply different dataframe methods to the matrix itself. We can use
the Pandas
While we lose a bit of precision doing this, it does make the relationships easier to read. In the next section, you’ll learn how to use the Seaborn library to plot a heat map based on the matrix. How to Plot a Heat map Correlation Matrix with SeabornIn many cases, you’ll want to visualize a correlation matrix. This is easily done in a heat map format where we can display values that we can better understand visually. The Seaborn library makes creating a heat map very easy, using the Let’s now import pyplot from matplotlib in order to visualize our data. While we’ll actually be using Seaborn to visualize the data, Seaborn relies heavily on matplotlib for its visualizations.
Here, we have imported
the We can see that a number of odd things have happened here. Firstly, we know that a correlation coefficient can take the values from -1 through +1. Our graph currently only shows values from roughly -0.5 through +1. Because of this, unless we’re careful, we may infer that negative relationships are strong than they actually are. Further, the data isn’t showing in a divergent manner. We want our colors to be strong as relationships become strong. Rather, the colors weaken as the values go close to +1. We can modify a few additional parameters here:
Let’s try this again, passing in these three new arguments:
This returns the following matrix. It diverges from -1 to +1 and the colors conveniently darken at either pole. A properly formatted heat map with divergent coloursIn this section, you learned how to format a heat map generated using Seaborn to better visualize relationships between columns. Plot Only the Lower Half of a Correlation Matrix with SeabornOne thing that you’ll notice is how redundant it is to show both the upper and lower half of a correlation matrix. Our minds can only interpret so much – because of this, it may be helpful to only show the bottom half of our visualization. Similarly, it can make sense to remove the diagonal line of 1s, since this has no real value. In order to accomplish this, we can use the numpy
This returns the following image: Displaying only the bottom half of a matrix using a numpy maskWe can see how much easier it is to understand the strength of our dataset’s relationships here. Because we’ve removed a significant amount of visual clutter (over half!), we can much better interpret the meaning behind the visualization. How to Save a Correlation Matrix to a File in PythonThere may be times when you want to actually save the correlation matrix programmatically. So far, we have
used the The file allows us to pass in a file path to indicate where we want to save the file. Say we wanted to save it in the directory where the script is running, we can pass in a relative path like below:
In the code shown above, we will save the file as a png file with the name heatmap. The file will be saved in the directory where the script is running. Selecting Only Strong Correlations in a Correlation MatrixIn some cases, you may only want to select strong correlations in a matrix. Generally, a correlation is considered to be strong when the absolute value is greater than or equal to 0.7. Since the matrix that gets returned is a Pandas Dataframe, we can use Pandas filtering methods to filter our dataframe. Since we want to select strong relationships, we need to be able to select values greater than or equal to 0.7 and less than or equal to -0.7 Since this would make our selection statement more complicated, we can simply filter on the absolute value of our correlation coefficient. Let’s take a look at how we can do this:
Here, we first take our matrix and apply the Selecting Only Positive / Negative Correlations in a Correlation MatrixIn some cases, you may want to select only positive correlations in a dataset or only negative correlations. We can, again, do this by first unstacking the dataframe and then selecting either only positive or negative relationships. Let’s first see how we can select only positive relationships:
We can see here that this process is nearly the same as selecting only strong relationships. We simply change our filter of the series to only include relationships where the coefficient is greater than zero. Similarly, if we wanted to select on negative relationships, we only need to change one character. We can change the
This is a helpful tool, allowing us to see which relationships are either direction. We can even combine these and select only strong positive relationships or strong negative relationships. ConclusionIn this tutorial, you learned how to use Python and Pandas to calculate a correlation matrix. You learned, briefly, what a correlation matrix is and how to interpret it. You then learned how to use the Pandas To learn more about the Pandas Additional ResourcesTo learn about related topics, check out the articles listed below:
|