This is kind of overkill but let's give it a go. First lets use statsmodel to find out what the p-values should be import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
and we get OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.518
Model: OLS Adj. R-squared: 0.507
Method: Least Squares F-statistic: 46.27
Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62
Time: 10:08:24 Log-Likelihood: -2386.0
No. Observations: 442 AIC: 4794.
Df Residuals: 431 BIC: 4839.
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 152.1335 2.576 59.061 0.000 147.071 157.196
x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424
x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488
x3 519.8398 66.534 7.813 0.000 389.069 650.610
x4 324.3904 65.422 4.958 0.000 195.805 452.976
x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801
x6 476.7458 339.035 1.406 0.160 -189.621 1143.113
x7 101.0446 212.533 0.475 0.635 -316.685 518.774
x8 177.0642 161.476 1.097 0.273 -140.313 494.442
x9 751.2793 171.902 4.370 0.000 413.409 1089.150
x10 67.6254 65.984 1.025 0.306 -62.065 197.316
==============================================================================
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Cond. No. 227.
==============================================================================
Ok, let's reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck. lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)
And this gives us. Coefficients Standard Errors t values Probabilities
0 152.1335 2.576 59.061 0.000
1 -10.0122 59.749 -0.168 0.867
2 -239.8191 61.222 -3.917 0.000
3 519.8398 66.534 7.813 0.000
4 324.3904 65.422 4.958 0.000
5 -792.1842 416.684 -1.901 0.058
6 476.7458 339.035 1.406 0.160
7 101.0446 212.533 0.475 0.635
8 177.0642 161.476 1.097 0.273
9 751.2793 171.902 4.370 0.000
10 67.6254 65.984 1.025 0.306
So we can reproduce the values from statsmodel. Before going on to learn how to find the p-value (significance) in scikit-learn, let’s understand its meaning and significance. In terms of statistical learning, linear regression is probably the simplest approach. What is P-value?P-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. – Wikipedia
In
any modeling task, we hypothesize some correlation between the features and the target. The null hypothesis is therefore the opposite: there is no correlation between features and targets. In hypothesis tests, a p-value is used to support or reject the null hypothesis. Smaller the p-value means the mightier the proof that the null hypothesis should be disregarded. P-values are expressed as decimals, but converting them to percentages may make them easier to understand. For
instance, p is 2.94% of 0.0294. This means that your results may be random by 2.94% (happened by chance). This is rather small. On the other hand, a high p-value of 91% means that your results are 91% random and are not due to anything in your experiment. The smaller the p-value therefore, the more important your results are (“significant”). - A p-value of <0.05 is statistically significant. It shows strong proof against the null hypothesis because since the probability is less
than 5%. Based on this, we accept the alternative hypothesis and dismiss the null hypothesis.
However, this does not mean that the alternative hypothesis is necessarily true. - A p-value greater than 0.05 is not meaningful statistically, and indicates strong evidence for null hypothesis. This means that we reject the alternative hypothesis and keep the null hypothesis.
You should be aware that the null hypothesis cannot be accepted, it can be either
rejected or not rejected. How To Find P-value (significance) In Scikit-learn?Let’s import a built-in dataset “diabetes” and run a linear regression model using Sklearn library. We’ll calculate p-values using statsmodels library as shown below: - First, let’s load the important libraries:
import pandas as
pd , numpy as np from sklearn import datasets, linear_model from sklearn.linear_model import LinearRegression import statsmodels.api as sma - Let’s load the diabetes dataset and define X & y:
diabetes_df = datasets.load_diabetes() X = diabetes_df.data y =
diabetes_df.target X2 = sma.add_constant(X) _1 = sma.OLS(y, X2) _2 = est.fit() - And the final step, let’s check the summary of our simple model (focus on p-values):
print(_2.summary()) If you noticed, we calculated the p-score using statsmodels library and not scikit-learn. Let’s write a function to calculate p-score using scikit-learn as shown below :
from scipy
import stats lm = LinearRegression() lm.fit(X,y) params = np.append(lm.intercept_,lm.coef_) predictions = lm.predict(X) new_X = np.append(np.ones((len(X),1)), X, axis=1) M_S_E = (sum((y-predictions)**2))/(len(new_X)-len(new_X[0])) v_b = M_S_E*(np.linalg.inv(np.dot(new_X.T,new_X)).diagonal()) s_b = np.sqrt(v_b) t_b = params/ s_b p_val =[2*(1-stats.t.cdf(np.abs(i),(len(new_X)-len(new_X[0])))) for i in t_b] p_val = np.round(p_val,3) p_val Summary
In this lesson on how to find p-value (significance) in scikit-learn, we compared the p-value to the pre-defined significant level to see if we can reject the null hypothesis (threshold). If p-value ≤ significant level, we reject the null hypothesis (H0) If p-value > significant level, we fail to reject the null hypothesis (H0) We also learned how to calculate p-values in using statsmodels and scikit-learn libraries.
References
- Information on P-values
- Official documentation of statsmodels
- Official documentation of scikit-learn
- Official documentation of
linear regression
- Image credit: xkcd
How do you find p
For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as t n − 2 , and is calculated as 2 × area past |t| under a t n − 2 curve. In this example, df = 30 − 2 = 28.
How do you find the p
One way to get the p-value is by using T-test. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations 'a' is equal to the given population mean, popmean.
What is p
P Value is a statistical test that determines the probability of extreme results of the statistical hypothesis test.
Does linear regression have p
How Do I Interpret the P-Values in Linear Regression Analysis? The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis.
|