Covariance between two time series python

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

    Pandas Series.cov() is used to find covariance of two series. In the following example, covariance is found using both Pandas method and manually ways and the answers are then compared.

    To learn more about Covariance, click here.

    Syntax: Series.cov(other, min_periods=None)
    Parameters: 
    other: Other series to be used in finding covariance 
    min_periods: Minimum number of observations to be taken to have a valid result
    Return type: Float value, Returns covariance of caller series and passed series 

    Example :
    In this example, two lists are made and converted to series using Pandas .Series() method. The average if both series is found and a function is created to find Covariance manually. Pandas .cov() is also applied and results from both ways are stored in variables and printed to compare the outputs.

    Python3

    import pandas as pd

    a = [2, 3, 2.7, 3.2, 4.1]

    b = [10, 14, 12, 15, 20]

    av_a = sum(a)/len(a)

    av_b = sum(b)/len(b)

    a = pd.Series(a)

    b = pd.Series(b)

    covar = a.cov(b)

    def covarfn(a, b, av_a, av_b):

        cov = 0

        for i in range(0, len(a)):

            cov += (a[i] - av_a) * (b[i] - av_b)

        return (cov / (len(a)-1))

    cov = covarfn(a, b, av_a, av_b)

    print("Results from Pandas method: ", covar)

    print("Results from manual function method: ", cov)

    Output: 

    As it can be seen in output, the output from both ways is same. Hence this method is useful when finding co variance for large series.

    Results from Pandas method:  2.8499999999999996
    Results from manual function method:  2.8499999999999996

    Introduction

    Working with variables in data analysis always drives the question: How are the variables dependent, linked, and varying against each other? Covariance and Correlation measures aid in establishing this.

    Covariance brings about the variation across variables. We use covariance to measure how much two variables change with each other. Correlation reveals the relation between the variables. We use correlation to determine how strongly linked two variables are to each other.

    In this article, we'll learn how to calculate the covariance and correlation in Python.

    Covariance and Correlation - In Simple Terms

    Both covariance and correlation are about the relationship between the variables. Covariance defines the directional association between the variables. Covariance values range from -inf to +inf where a positive value denotes that both the variables move in the same direction and a negative value denotes that both the variables move in opposite directions.

    Correlation is a standardized statistical measure that expresses the extent to which two variables are linearly related (meaning how much they change together at a constant rate). The strength and directional association of the relationship between two variables are defined by correlation and it ranges from -1 to +1. Similar to covariance, a positive value denotes that both variables move in the same direction whereas a negative value tells us that they move in opposite directions.

    Both covariance and correlation are vital tools used in data exploration for feature selection and multivariate analyses. For example, an investor looking to spread the risk of a portfolio might look for stocks with a high covariance, as it suggests that their prices move up at the same time. However, a similar movement is not enough on its own. The investor would then use the correlation metric to determine how strongly linked those stock prices are to each other.

    Setup for Python Code - Retrieving Sample Data

    With the basics learned from the previous section, let's move ahead to calculate covariance in python. For this example, we will be working on the well-known Iris dataset. We're only working with the setosa species to be specific, hence this will be just a sample of the dataset about some lovely purple flowers!

    Let's have a look at the dataset, on which we will be performing the analysis:

    We are about to pick two columns, for our analysis - sepal_length and sepal_width.

    In a new Python file (you can name it covariance_correlation.py), let's begin by creating two lists with values for the sepal_length and sepal_width properties of the flower:

    with open('iris_setosa.csv','r') as f:
        g=f.readlines()
        # Each line is split based on commas, and the list of floats are formed 
        sep_length = [float(x.split(',')[0]) for x in g[1:]]
        sep_width  = [float(x.split(',')[1]) for x in g[1:]]
    

    In data science, it always helps to visualize the data you're working on. Here's a Seaborn regression plot (Scatter Plot + linear regression fit) of these setosa properties on different axes:

    Visually the data points seem to be having a high correlation close to the regression line. Let's see if our observations match up to their covariance and correlation values.

    Calculating Covariance in Python

    The following formula computes the covariance:

    In the above formula,

    • xi, yi - are individual elements of the x and y series
    • x̄, y̅ - are the mathematical means of the x and y series
    • N - is the number of elements in the series

    The denominator is N for a whole dataset and N - 1 in the case of a sample. As our dataset is a small sample of the entire Iris dataset, we use N - 1.

    With the math formula mentioned above as our reference, let's create this function in pure Python:

    def covariance(x, y):
        # Finding the mean of the series x and y
        mean_x = sum(x)/float(len(x))
        mean_y = sum(y)/float(len(y))
        # Subtracting mean from the individual elements
        sub_x = [i - mean_x for i in x]
        sub_y = [i - mean_y for i in y]
        numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
        denominator = len(x)-1
        cov = numerator/denominator
        return cov
    
    with open('iris_setosa.csv', 'r') as f:
        ...
        cov_func = covariance(sep_length, sep_width)
        print("Covariance from the custom function:", cov_func)
    

    We first find the mean values of our datasets. We then use a list comprehension to iterate over every element in our two series' of data and subtract their values from the mean. A for loop could have been used a well if that's your preference.

    We then use those intermediate values of the two series' and multiply them with each other in another list comprehension. We sum the result of that list and store it as the numerator. The denominator is a lot easier to calculate, be sure to decraese it by 1 when you're finding the covariance for sample data!

    Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

    We then return the value when the numerator is divided by its denominator, which results in the covariance.

    Running our script would give us this output:

    Covariance from the custom function: 0.09921632653061219
    

    The positive value denotes that both the variables move in the same direction.

    Calculating Correlation in Python

    The most widely used formula to compute correlation coefficient is Pearson's 'r':

    In the above formula,

    • xi, yi - are individual elements of the x and y series
    • The numerator corresponds to the covariance
    • The denominators correspond to the individual standard deviations of x and y

    Seems like we've discussed everything we need to get the correlation in this series of articles!

    Let's calculate the correlation now:

    def correlation(x, y):
        # Finding the mean of the series x and y
        mean_x = sum(x)/float(len(x))
        mean_y = sum(y)/float(len(y))
        # Subtracting mean from the individual elements
        sub_x = [i-mean_x for i in x]
        sub_y = [i-mean_y for i in y]
        # covariance for x and y
        numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
        # Standard Deviation of x and y
        std_deviation_x = sum([sub_x[i]**2.0 for i in range(len(sub_x))])
        std_deviation_y = sum([sub_y[i]**2.0 for i in range(len(sub_y))])
        # squaring by 0.5 to find the square root
        denominator = (std_deviation_x*std_deviation_y)**0.5 # short but equivalent to (std_deviation_x**0.5) * (std_deviation_y**0.5)
        cor = numerator/denominator
        return cor
    
    with open('iris_setosa.csv', 'r') as f:
        ...
        cor_func = correlation(sep_length, sep_width)
        print("Correlation from the custom function:", cor_func)
    

    As this value needs the covariance of the two variables, our function pretty much works out that value once again. Once the covariance is computed, we then calculate the standard deviation for each variable. From there, the correlation is simply dividing the covariance with the multiplication of the squares of the standard deviation.

    Running this code we get the following output, confirming that these properties have a positive (sign of the value, either +, -, or none if 0) and strong (the value is close to 1) relationship:

    Correlation from the custom function: 0.7425466856651597
    

    Conclusion

    In this article, we learned two statistical instruments: covariance and correlation in detail. We've learned what their values mean for our data, how they are represented in Mathematics and how to implement them in Python. Both of these measures can be very helpful in determining relationships between two variables.

    How do you find the covariance between two variables in Python?

    cov() function. Covariance provides the a measure of strength of correlation between two variable or more set of variables. The covariance matrix element Cij is the covariance of xi and xj.

    How do you calculate covariance in pandas?

    Pandas DataFrame: cov() function The cov() function is used to compute pairwise covariance of columns, excluding NA/null values. Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

    How do you plot covariance in Python?

    A covariance matrix is a square matrix that shows the covariance between many different variables..
    Step 1: Create the dataset. ... .
    Step 2: Create the covariance matrix. ... .
    Step 3: Interpret the covariance matrix. ... .
    Step 4: Visualize the covariance matrix (optional)..