Covariance between two time series python

Question

View Discussion

Nội dung chính Show

Introduction
Covariance and Correlation - In Simple Terms
Setup for Python Code - Retrieving Sample Data
Calculating Covariance in Python
Calculating Correlation in Python
How do you find the covariance between two variables in Python?
How do you calculate covariance in pandas?
How do you plot covariance in Python?

Improve Article

Save Article

Read

Discuss

View Discussion

Improve Article

Save Article

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas Series.cov() is used to find covariance of two series. In the following example, covariance is found using both Pandas method and manually ways and the answers are then compared.

To learn more about Covariance, click here.

Syntax: Series.cov(other, min_periods=None)
Parameters:
other: Other series to be used in finding covariance
min_periods: Minimum number of observations to be taken to have a valid result
Return type: Float value, Returns covariance of caller series and passed series

Example :
In this example, two lists are made and converted to series using Pandas .Series() method. The average if both series is found and a function is created to find Covariance manually. Pandas .cov() is also applied and results from both ways are stored in variables and printed to compare the outputs.

Python3

import pandas as pd

a = [2, 3, 2.7, 3.2, 4.1]

b = [10, 14, 12, 15, 20]

av_a = sum(a)/len(a)

av_b = sum(b)/len(b)

a = pd.Series(a)

b = pd.Series(b)

covar = a.cov(b)

def covarfn(a, b, av_a, av_b):

cov = 0

for i in range(0, len(a)):

cov += (a[i] - av_a) * (b[i] - av_b)

return (cov / (len(a)-1))

cov = covarfn(a, b, av_a, av_b)

print("Results from Pandas method: ", covar)

print("Results from manual function method: ", cov)

Output:

As it can be seen in output, the output from both ways is same. Hence this method is useful when finding co variance for large series.

Results from Pandas method:  2.8499999999999996
Results from manual function method:  2.8499999999999996

Introduction

Working with variables in data analysis always drives the question: How are the variables dependent, linked, and varying against each other? Covariance and Correlation measures aid in establishing this.

Covariance brings about the variation across variables. We use covariance to measure how much two variables change with each other. Correlation reveals the relation between the variables. We use correlation to determine how strongly linked two variables are to each other.

In this article, we'll learn how to calculate the covariance and correlation in Python.

Covariance and Correlation - In Simple Terms

Both covariance and correlation are about the relationship between the variables. Covariance defines the directional association between the variables. Covariance values range from -inf to +inf where a positive value denotes that both the variables move in the same direction and a negative value denotes that both the variables move in opposite directions.

Correlation is a standardized statistical measure that expresses the extent to which two variables are linearly related (meaning how much they change together at a constant rate). The strength and directional association of the relationship between two variables are defined by correlation and it ranges from -1 to +1. Similar to covariance, a positive value denotes that both variables move in the same direction whereas a negative value tells us that they move in opposite directions.

Both covariance and correlation are vital tools used in data exploration for feature selection and multivariate analyses. For example, an investor looking to spread the risk of a portfolio might look for stocks with a high covariance, as it suggests that their prices move up at the same time. However, a similar movement is not enough on its own. The investor would then use the correlation metric to determine how strongly linked those stock prices are to each other.

Setup for Python Code - Retrieving Sample Data

With the basics learned from the previous section, let's move ahead to calculate covariance in python. For this example, we will be working on the well-known Iris dataset. We're only working with the setosa species to be specific, hence this will be just a sample of the dataset about some lovely purple flowers!

Let's have a look at the dataset, on which we will be performing the analysis:

We are about to pick two columns, for our analysis - sepal_length and sepal_width.

In a new Python file (you can name it covariance_correlation.py), let's begin by creating two lists with values for the sepal_length and sepal_width properties of the flower:

with open('iris_setosa.csv','r') as f:
    g=f.readlines()
    # Each line is split based on commas, and the list of floats are formed 
    sep_length = [float(x.split(',')[0]) for x in g[1:]]
    sep_width  = [float(x.split(',')[1]) for x in g[1:]]

In data science, it always helps to visualize the data you're working on. Here's a Seaborn regression plot (Scatter Plot + linear regression fit) of these setosa properties on different axes:

Visually the data points seem to be having a high correlation close to the regression line. Let's see if our observations match up to their covariance and correlation values.

Calculating Covariance in Python

The following formula computes the covariance:

In the above formula,

xi, yi - are individual elements of the x and y series
x̄, y̅ - are the mathematical means of the x and y series
N - is the number of elements in the series

The denominator is N for a whole dataset and N - 1 in the case of a sample. As our dataset is a small sample of the entire Iris dataset, we use N - 1.

With the math formula mentioned above as our reference, let's create this function in pure Python:

def covariance(x, y):
    # Finding the mean of the series x and y
    mean_x = sum(x)/float(len(x))
    mean_y = sum(y)/float(len(y))
    # Subtracting mean from the individual elements
    sub_x = [i - mean_x for i in x]
    sub_y = [i - mean_y for i in y]
    numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
    denominator = len(x)-1
    cov = numerator/denominator
    return cov

with open('iris_setosa.csv', 'r') as f:
    ...
    cov_func = covariance(sep_length, sep_width)
    print("Covariance from the custom function:", cov_func)

We first find the mean values of our datasets. We then use a list comprehension to iterate over every element in our two series' of data and subtract their values from the mean. A for loop could have been used a well if that's your preference.

We then use those intermediate values of the two series' and multiply them with each other in another list comprehension. We sum the result of that list and store it as the numerator. The denominator is a lot easier to calculate, be sure to decraese it by 1 when you're finding the covariance for sample data!

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

We then return the value when the numerator is divided by its denominator, which results in the covariance.

Running our script would give us this output:

Covariance from the custom function: 0.09921632653061219

The positive value denotes that both the variables move in the same direction.

Calculating Correlation in Python

The most widely used formula to compute correlation coefficient is Pearson's 'r':

In the above formula,

xi, yi - are individual elements of the x and y series
The numerator corresponds to the covariance
The denominators correspond to the individual standard deviations of x and y

Seems like we've discussed everything we need to get the correlation in this series of articles!

Let's calculate the correlation now:

def correlation(x, y):
    # Finding the mean of the series x and y
    mean_x = sum(x)/float(len(x))
    mean_y = sum(y)/float(len(y))
    # Subtracting mean from the individual elements
    sub_x = [i-mean_x for i in x]
    sub_y = [i-mean_y for i in y]
    # covariance for x and y
    numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
    # Standard Deviation of x and y
    std_deviation_x = sum([sub_x[i]**2.0 for i in range(len(sub_x))])
    std_deviation_y = sum([sub_y[i]**2.0 for i in range(len(sub_y))])
    # squaring by 0.5 to find the square root
    denominator = (std_deviation_x*std_deviation_y)**0.5 # short but equivalent to (std_deviation_x**0.5) * (std_deviation_y**0.5)
    cor = numerator/denominator
    return cor

with open('iris_setosa.csv', 'r') as f:
    ...
    cor_func = correlation(sep_length, sep_width)
    print("Correlation from the custom function:", cor_func)

As this value needs the covariance of the two variables, our function pretty much works out that value once again. Once the covariance is computed, we then calculate the standard deviation for each variable. From there, the correlation is simply dividing the covariance with the multiplication of the squares of the standard deviation.

Running this code we get the following output, confirming that these properties have a positive (sign of the value, either +, -, or none if 0) and strong (the value is close to 1) relationship:

Correlation from the custom function: 0.7425466856651597

Conclusion

In this article, we learned two statistical instruments: covariance and correlation in detail. We've learned what their values mean for our data, how they are represented in Mathematics and how to implement them in Python. Both of these measures can be very helpful in determining relationships between two variables.

How do you find the covariance between two variables in Python?

cov() function. Covariance provides the a measure of strength of correlation between two variable or more set of variables. The covariance matrix element C_ij is the covariance of xi and xj.

How do you calculate covariance in pandas?

Pandas DataFrame: cov() function The cov() function is used to compute pairwise covariance of columns, excluding NA/null values. Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.