Ví dụ mô hình hồi quy đa biến

Question

Cũng giống như chương Bảng mô tả, chúng ta cần xác định packahe nào trong R mà chúng ta muốn sử dụng. Chúng tôi trình bày hai chọn lựa để thực hiện các phân tích đơn biến:

base R

Hồi quy tuyến tính

Hàm

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

8 trong base cho phép thực hiện hồi quy tuyến tính để đánh giá mối quan hệ giữa biến đầu ra dạng số (numeric) và các biến giải thích mà được giả định là có mối quan hệ tuyến tính.

Cung cấp phương trình dưới dạng công thức với tên của biến đầu ra và các biến giải thích được phân tách bằng dấu ngã

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

9. Bên cạnh đó, chỉ rõ bộ số liệu nào được sử dụng với

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

0. Kết quả của mô hình được định nghĩa dưới dạng đối tượng của R để sử dụng về sau.

lm_results <- lm(ht_cm ~ age, data = linelist)

Sau đó tóm tắt kết quả của mô hình bằng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

1 để xem các hệ số (ước tính), P-value, phần dư và các đo lường khác.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

Ngoài ra, có thể dùng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 trong package broom để xuất kết quả vào trong một bảng. Kết quả bên dưới cho chúng ta biết khi tăng thêm một tuổi thì chiều cao tăng 3,5 cm và mối quan hệ này có ý nghĩa thống kê.

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

Sau đó, có thể sử dụng kết quả hồi quy này để đưa vào ggplot. Để thực hiện điều này, trước tiên chúng ta đưa các giá trị quan sát và đường thẳng hồi quy (fitted line) vào một data frame bằng cách dùng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

3 trong package broom.

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

Bạn cũng có thể vẽ đường hồi quy tuyến tính đơn bằng package ggplot thông qua hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

4.

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Xem thêm các hướng dẫn chi tiết trong mục Nguồn ở cuối chương này.

Hồi quy Logistic

Hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 trong package stats (một phần của base R) được sử dụng để fit (chọn mô hình dự đoán tối ưu dựa trên số liệu quan sát) đối với Mô hình Tuyến tính Tổng quát (GLM).

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 có thể được sử dụng cho cả hồi quy logistic đơn biến và đa biến (ví dụ như để tính tỷ số chênh OR). Sau đây là những thành phần chính của hàm:

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

7 Mô hình được cung cấp cho

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 dưới dạng một phương trình với biến kết cục ở bên trái và biến giải thích ở bên phải dấu ngã

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

9.

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

0 Xác định loại mô hình sẽ thực hiện. Đối với hồi quy logistic, sử dụng

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

1, đối với hồi quy poisson sử dụng

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

2. Các ví dụ khác được trình bày trong bảng bên dưới.

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

0 Cụ thể bộ số liệu

Nếu cần, có thể cụ thể hàm liên kết bằng cú pháp

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

4. Bạn có thể tìm đọc thêm về các họ hồi quy khác và các tùy chọn đối số như là

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

5 và

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

6 bằng cách gõ (

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

7).

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

8

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

9

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

0

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

1

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

2

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

3

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

4

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

5

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

6

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

7

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

8

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

9

## `geom_smooth()` using formula 'y ~ x'

0

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

9

## `geom_smooth()` using formula 'y ~ x'

2

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)

7

Khi thực hiện

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 , phổ biến nhất là lưu kết quả dưới dạng một đối tượng của R được đặt tên. Sau đó, có thể xuất kết quả ra console bằng cách sử dụng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

1 như được trình bày bên dưới, hoặc thực hiện các thao tác khác từ kết quả (ví dụ như lấy lũy thừa).

Nếu cần thực hiện một hồi quy nhị thức âm, có thể sử dụng package MASS. Hàn

## `geom_smooth()` using formula 'y ~ x'

6 uses cũng sử dụng cùng cú pháp như

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5. Để xem qua các hồi quy khác, xem trên trang thống kê của UCLA.

Phân tích đơn biến sử dụng

## `geom_smooth()` using formula 'y ~ x'

8

Trong ví dụ này, chúng tôi đánh giá mối liên quan giữa nhóm tuổi và biến kết cục tử vong (được mã hóa là 1 trong phần chuẩn bị). Bên dưới là một mô hình đơn biến của biến kết cục

## `geom_smooth()` using formula 'y ~ x'

9 theo

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

0. Chúng tôi lưu kết quả đầu ra được đặt tên là

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

1 và sau đó in kết quả đến console bằng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

1. Lưu ý, các ước tính được tạo ra là các giá trị lôgarít của tỷ số chênh (log odds) và giá trị tham chiếu là giá trị đầu tiên của biến

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

0 (“0-4”).

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

Để thay đổi giá trị tham chiếu của một biến Factor và chuyển giá trị mong muốn lên vị trí đầu tiên, dùng hàm

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

4 (xem chương Factors). Ở ví dụ bên dưới, chúng tôi lấy biến

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

0 và đặt nhóm tuổi “20-29” làm giá trị tham chiếu trước khi chuyển số liệu đã sửa đổi vào hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5.

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

0

In kết quả

Đối với hầu hết các mục đích sử dụng, kết quả đầu ra cần phải có một số sửa đổi. Hàm làm gọn

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 trong package broom có những tiện lợi để hiển thị kết quả của mô hình.

Ở đây, chúng tôi trình bày cách để kết hợp các kết quả đầu ra của mô hình vào trong một bảng.

Lấy lũy thừa logarit của ước lượng tỉ số chênh OR và khoảng tin cậy bằng cách đưa mô hình vào hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 và thiết lập lũy thừa

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

9 và

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

0.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

1

Bên dưới là bảng kết quả đầu ra của

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

1:

Kết hợp các kết quả của mô hình vào trong một bảng đếm. Dưới đây, chúng tôi tạo một bảng đếm bằng hàm
```
model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)
```
2 từ package janitor, như được đề cập trong chương Bảng mô tả.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

2

Đây là cách mà bảng

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

3 được hiển thị:

Bây giờ chúng ta có thể nối bảng

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

3 và kết quả của mô hình

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

1 lại với nhau theo chiều ngang bằng hàm nối cột

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

6 (dplyr). Hãy nhớ rằng đối với hàm

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

6 các hàng trong hai cấu trúc dữ liệu trên phải được căn chỉnh hoàn hảo. Trong đoạn code này, bởi vì chúng ta đang thực hiện một chuỗi các thuật toán pipe, chúng ta sử dụng dấu

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

8 để đại diện cho đối tượng được nối trong bảng đếm

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

3 khi chúng tôi nối nó với kết quả mô hình

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

1. Để kết thúc quy trình này, chúng ta sử dụng hàm

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

1 để chọn các cột mong muốn và thứ tự của nó, và cuối cùng áp dụng hàm

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

2 trong base R để làm tròn với hai chữ số thập phân cho tất cả các cột.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

3

Đây là hiển thị của cấu trúc đã được kết hợp, nó được xuất gọn gẽ dưới dạng một hình bằng thông qua một hàm trong package flextable. Chương Trình bày bảng giải thích cách tùy chỉnh các bảng như vậy bằng flextable, hoặc có thể sử dụng các gói lệnh khác như knitr hoặc GT.

Vòng lặp cho nhiều mô hình đơn biến

Sau đây chúng tôi trình bày một phương pháp sử dụng

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 và

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 để có một cách tiếp cận đơn giản hơn, xem thêm ở phần gtsummary.

Để thực hiện các mô hình cho một số biến giải thích và cho ra các tỷ số chênh trong phân tích đơn biến (nghĩa là không có kiểm soát lẫn nhau), chúng ta có thể sử dụng các cách tiếp cận dưới đây. Sử dụng hàm

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

5 từ package stringr để tạo ra các công thức cho phân tích đơn biến (xem chương Ký tự và chuỗi), thực hiện hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 cho mỗi công thức, chuyển mỗi kết quả đầu ra của

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5 đến hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 và cuối cùng thu gọn lại tất các kết quả đầu ra của mô hình bằng hàm nối dòng

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

9 từ tidyr. Phương pháp này sử dụng hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 từ package purrr để lặp - xem chương [Lặp, vòng lặp và danh sách] để biết thêm thông tin về công cụ này.

Tạo một véctơ tên các cột của biến giải thích. Chúng ta đã tạo biến này

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

1 trong phần chuẩn bị của chương này.

Sử dụng hàm

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

5 để tạo các công thức chuỗi với biến kết cục

## `geom_smooth()` using formula 'y ~ x'

9 ở bên trái và tên một cột của véctơ

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

1 ở bên phải. Dấu chấm

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)

8 trong hàm này thay thế cho tên cột trong véctơ

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

1.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

4

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

5

Đưa các công thức chuỗi này vào hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 và đặt

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

8 làm hàm áp dụng cho mỗi đầu vào. Bên trong hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

5, thiết lập công thức hồi quy

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

00 trong đó

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

01 sẽ được thay thế bằng các công thức chuỗi đã được tạo bên trên. Hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 sẽ lặp từng công thức chuỗi và thực hiện hồi quy cho từng công thức.

Kết quả đầu ra của hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 đầu tiên sẽ được chuyển đến hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 thứ hai mà sử dụng hàm

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##                          
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

2 để làm gọn các kết quả đầu ra.

Cuối cùng, kết quả đầu ra của hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0 thứ hai (một danh sách các data frames đã được làm gọn) được tóm tắt bằng hàm nối dòng

## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

9, kết quả cho ra một data frame với tất cả các kết quả đơn biến.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

6

Lúc này, kết quả xuất ra của

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

08 dài hơn bởi vì kết quả bây giờ bao gồm các kết quả đầu ra của một số hồi quy đơn biến. Nhấp nút tiếp theo để xem tất cả các hàng của

# arguments for glm()
glm(formula, family, data, weights, subset, ...)

1.

Như lúc trước, chúng ta có thể tạo một bảng đếm từ bộ số liệu

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

10 cho mỗi biến giải thích, gắn chúng với

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

08, và tạo ra một bảng đẹp. Chúng ta bắt đầu với các biến giải thích này, và lặp lại các biến này thông qua hàm

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()

0. Chúng ta lặp lại qua một hàm do người dùng tạo ra mà liên quan đến việc tạo ra một bảng đếm bằng cách dùng các hàm trong package dplyr Sau đó, kết quả được kết nối trình tự với kết quả của mô hình

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

08.

## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

7

Bên dưới là cấu trúc số liệu kết nối được tạo ra. Xem chương Trình bày bảng để có thêm ý tưởng về cách chuyển đổi bảng số liệu này thành một bảng đẹp trên HTML (ví dụ như với package flextable).

Hỏi Đáp Ví dụ