How to calculate p-value in linear regression python

Question

This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be

Nội dung chính Show

What is P-value?
How To Find P-value (significance) In Scikit-learn?
References
How do you find p
How do you find the p
Does linear regression have p

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

and we get

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

And this gives us.

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

So we can reproduce the values from statsmodel.

Before going on to learn how to find the p-value (significance) in scikit-learn, let’s understand its meaning and significance. In terms of statistical learning, linear regression is probably the simplest approach.

What is P-value?

P-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
– Wikipedia

In any modeling task, we hypothesize some correlation between the features and the target. The null hypothesis is therefore the opposite: there is no correlation between features and targets. In hypothesis tests, a p-value is used to support or reject the null hypothesis. Smaller the p-value means the mightier the proof that the null hypothesis should be disregarded.

P-values are expressed as decimals, but converting them to percentages may make them easier to understand. For instance, p is 2.94% of 0.0294. This means that your results may be random by 2.94% (happened by chance). This is rather small.

On the other hand, a high p-value of 91% means that your results are 91% random and are not due to anything in your experiment. The smaller the p-value therefore, the more important your results are (“significant”).

A p-value of <0.05 is statistically significant. It shows strong proof against the null hypothesis because since the probability is less than 5%. Based on this, we accept the alternative hypothesis and dismiss the null hypothesis.

However, this does not mean that the alternative hypothesis is necessarily true.

A p-value greater than 0.05 is not meaningful statistically, and indicates strong evidence for null hypothesis. This means that we reject the alternative hypothesis and keep the null hypothesis.

You should be aware that the null hypothesis cannot be accepted, it can be either rejected or not rejected.

How To Find P-value (significance) In Scikit-learn?

Let’s import a built-in dataset “diabetes” and run a linear regression model using Sklearn library.

We’ll calculate p-values using statsmodels library as shown below:

First, let’s load the important libraries:

import pandas as pd , numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sma

Let’s load the diabetes dataset and define X & y:

diabetes_df = datasets.load_diabetes()
X = diabetes_df.data
y = diabetes_df.target
X2 = sma.add_constant(X)

Let’s fit the model:

_1 = sma.OLS(y, X2)
_2 = est.fit()

And the final step, let’s check the summary of our simple model (focus on p-values):

print(_2.summary())

If you noticed, we calculated the p-score using statsmodels library and not scikit-learn. Let’s write a function to calculate p-score using scikit-learn as shown below :

from scipy import stats
lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
new_X = np.append(np.ones((len(X),1)), X, axis=1)
M_S_E = (sum((y-predictions)**2))/(len(new_X)-len(new_X[0]))
v_b = M_S_E*(np.linalg.inv(np.dot(new_X.T,new_X)).diagonal())
s_b = np.sqrt(v_b)
t_b = params/ s_b
p_val =[2*(1-stats.t.cdf(np.abs(i),(len(new_X)-len(new_X[0])))) for i in t_b]
p_val = np.round(p_val,3)
p_val

Summary

In this lesson on how to find p-value (significance) in scikit-learn, we compared the p-value to the pre-defined significant level to see if we can reject the null hypothesis (threshold).

If p-value ≤ significant level, we reject the null hypothesis (H0)
If p-value > significant level, we fail to reject the null hypothesis (H0)

We also learned how to calculate p-values in using statsmodels and scikit-learn libraries.

References

Information on P-values
Official documentation of statsmodels
Official documentation of scikit-learn
Official documentation of linear regression
Image credit: xkcd

How do you find p

For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as t n − 2 , and is calculated as 2 × area past |t| under a t n − 2 curve. In this example, df = 30 − 2 = 28.

How do you find the p

One way to get the p-value is by using T-test. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations 'a' is equal to the given population mean, popmean.

What is p

P Value is a statistical test that determines the probability of extreme results of the statistical hypothesis test.

Does linear regression have p

How Do I Interpret the P-Values in Linear Regression Analysis? The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis.