0%

R-model-predict

使用R建模并预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

data("cars")
head(cars)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10

Linear Regression

Model

我们先建一个简单的线性回归模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
model <- lm(dist~speed, cars)
summary(model)

##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

该模型为: dist = -17.579 + 3.932*speed.

Prediction

Confidence interval

使用 predict 函数根据新的数据进行预测,并给出预测值和其平均值的95%置信区间

1
2
3
4
5
6
7
speeds <- data.frame(speed=c(10, 20, 53))
predict(model, newdata = speeds, interval = "confidence")

## fit lwr upr
## 1 21.74499 15.46192 28.02807
## 2 61.06908 55.24729 66.89088
## 3 190.83857 159.12292 222.55422

Prediction interval

给出输入的对应预测值的95%置信区间

1
2
3
4
5
6
predict(model, newdata = speeds, interval = "prediction")

## fit lwr upr
## 1 21.74499 -9.809601 53.29959
## 2 61.06908 29.603089 92.53507
## 3 190.83857 146.542994 235.13415

可视化预测的结果

1
2
3
4
5
6
7
8
9
10
11
# 1. Add predictions 
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
p <- ggplot(mydata, aes(speed, dist)) +
geom_point() +
stat_smooth(method = lm, formula = y~x)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
theme_bw()

其中,

  • 蓝色的是线性回归拟合曲线

  • 灰色的带为置信区间

  • 红色的虚线为预测值区间

GLM

在R中,广义线性回归使用 glm 函数实现

Model

family 参数选择拟合的回归模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
glm.model <- glm(dist~speed, data = cars, family = gaussian)
summary(glm.model)

##
## Call:
## glm(formula = dist ~ speed, family = gaussian, data = cars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 236.5317)
##
## Null deviance: 32539 on 49 degrees of freedom
## Residual deviance: 11354 on 48 degrees of freedom
## AIC: 419.16
##
## Number of Fisher Scoring iterations: 2

该模型为: dist = -17.579 + 3.934*speed.

与简单线性回归相差不大

Prediction

对于glm对象的prediction,可以设置 se.fit = TRUE
来显示预测的标准误和用于计算标准误的残差

1
2
3
4
5
6
7
8
9
10
11
12
predict(glm.model, newdata = speeds, se.fit = TRUE)

## $fit
## 1 2 3
## 21.74499 61.06908 190.83857
##
## $se.fit
## 1 2 3
## 3.124921 2.895501 15.773951
##
## $residual.scale
## [1] 15.37959

LOESS regression

还可以使用 LOESS (Local Polynomial Regression Fitting) 的方法拟合并预测

1
2
3
4
5
cars %>% 
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = 'loess', formula = y~x, span = 1) + # span: 0.1 ~ ``
theme_classic()

Fitting only

在默认设置下loess拟合模型只能预测处于原始数据range中的值,超出range的值无法预测

1
2
3
4
5
cars.lo <- loess(dist~speed, cars)
predict(cars.lo, speeds)

## 1 2 3
## 21.86532 56.46132 NA

Extrapolation

如果想使用loess预测超出range的值,可以设置
control = loess.control(surface = "direct")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cars.lo2 <- loess(dist ~ speed, cars,
control = loess.control(surface = "direct"))
predict(cars.lo2, speeds, se = TRUE)

## $fit
## 1 2 3
## 21.86532 56.44526 963.89286
##
## $se.fit
## 1 2 3
## 4.119331 4.061865 467.666621
##
## $residual.scale
## [1] 15.31087
##
## $df
## [1] 44.55085

但这里需要考虑loess smoothing的span, 如果这个值过小,会过于拟合原始数据,导致预测准确度不高。

以上就是对R中几种线性回归模型建模和预测方法的简述。

完。

ref

https://www.journaldev.com/45290/predict-function-in-r

http://www.sthda.com/english/articles/40-regression-analysis/166-predict-in-r-model-predictions-and-confidence-intervals/