0%

R-sampling

本文简单介绍几种重抽样方法 (in R)。

我们生成一组数据,其中x是我们的观测值,y是我们对其的标签。

1
2
3
4
5
6
# generate random data
set.seed(1111)
x <- c(rnorm(10), rnorm(10, mean=5, sd=5))
y <- c(rep("A", 10), rep("B", 10))
df1 <- data.frame(x,y)
str(df1)
1
2
3
## 'data.frame':    20 obs. of  2 variables:
## $ x: num -0.0866 1.3225 0.6397 1.1748 0.1163 ...
## $ y: chr "A" "A" "A" "A" ...

Permutation

Permutation相当于是一种无放回的重抽样方法,通常用于假设检验。

1
2
3
# single permutation 
set.seed(2222)
sample(df1$x, replace = FALSE)
1
2
3
4
##  [1]  9.31952342 -0.08658011  5.56155482  3.71174074  1.17478657  0.63970204
## [7] -2.93084636 0.18759838 1.38404752 1.28394086 0.11629031 6.77816542
## [13] 1.11777194 0.11760093 -4.08500801 1.32252443 4.59743697 -2.67140783
## [19] 0.67750806 9.95418174

我们可以使用Permutation test检验A,B两组的值是否有差异

1
2
3
4
5
6
7
8
9
10
11
12
13
# permutation test (n=1000)
set.seed(3333)
permt.ls <- list()
for (i in 1:1000) {
permt.i <- sample(df1$x, replace = FALSE)
# calculate the mean of difference from permutated samples
diff.i <- abs(mean(permt.i[1:10]) - mean(permt.i[11:20]))
permt.ls[[i]] <- c(permt.i, diff.i)
}
permt.df <- Reduce(rbind, permt.ls)
diff.raw <- abs(mean(df1$x[1:10]) - mean(df1$x[11:20]))
# Calculate the p-value
mean(permt.df[,21] >= diff.raw)
1
## [1] 0.079

每一次重抽样后,我们都可以计算两组样本均值的差异。如果重抽样样本组间差异大于原始样本组间差异的话,可以认为是一次错误事件,通过计算错误事件在总重抽样次数中的占比就可以得到置换检验的p值。

在这里1000次重抽样中,只有79次是错误事件,所以我们的p值为0.079

Bootstrap

Bootstrap是一种有放回的重抽样方法,通常用于参数估计。

1
2
3
# single bootstrap
set.seed(4444)
sample(df1$x, replace = TRUE)
1
2
3
4
##  [1]  1.1177719  9.9541817 -2.9308464  0.1875984 -2.9308464 -4.0850080
## [7] -4.0850080 4.5974370 0.1162903 0.6397020 6.7781654 4.5974370
## [13] 1.3225244 9.3195234 1.2839409 -4.0850080 0.6397020 1.1177719
## [19] 0.6775081 3.7117407

例如,我们用bootstrap估计总体的均值

1
2
3
4
5
6
7
8
9
10
set.seed(5555)
mean.raw <- mean(df1$x)
mean.i <- c()
# bootstrap (n=1000)
for (i in 1:1000) {
boot.i <- sample(df1$x, replace = TRUE)
mean.i[i] <- mean(boot.i)
}
mean.boost <- mean(mean.i)
mean.raw; mean.boost
1
2
3
## [1] 1.908527

## [1] 1.956184

此外,我们还可以计算bootstrap预测均值的标准误(SE)

1
2
3
# calculate the standard error 
mean.se.boost <- sqrt(sum((mean.i - mean.boost)^2)/(1000-1))
mean.se.boost
1
## [1] 0.788522

Jackknife

Jakknife可以被认为是一种leave-one-out的重抽样方法,对于大小为k的数据集,将产生k个大小为k-1的样本.

1
2
3
4
5
6
7
jack.ls <- list()

for (i in 1:nrow(df1)) {
jack.ls[[i]] <- df1[-i,]
}

length(jack.ls); dim(jack.ls[[1]])
1
2
3
## [1] 20

## [1] 19 2

Cross validation

交叉验证将数据切分为测试集和验证集,常在模型拟合中使用。例如k-fold cross validation将数据划分为k组不重叠的数据集。

Figure: Illustration of 5-fold CV (https://yey.world/2020/08/31/MAST90083-05/).

1
2
3
4
5
6
7
8
# 1-fold cv
set.seed(6666)
training_size <- round(nrow(df1)*0.7)
training_idx <- sample(nrow(df1), size = training_size, replace = FALSE)
training_set <- df1[training_idx,]
val_set <- df1[-training_idx,]

nrow(training_set); nrow(val_set)
1
2
3
## [1] 14

## [1] 6

有时候,由于样本量太小,无法满足k-fold CV中不重叠分组的要求,有的样本不可避免地被重复使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 5-fold cv
set.seed(7777)
fold_idx <- list()
val_size <- nrow(df1) - training_size
all_idx <- 1:nrow(df1)
for (i in 1:5) {
fold_idx_i <- unique(unlist(fold_idx))
if (all(all_idx %in% fold_idx_i) | is.null(fold_idx_i)) {
fold_idx[[i]] <- sample(all_idx, val_size, replace = FALSE)
} else if (val_size < length(all_idx[-fold_idx_i])) {
fold_idx[[i]] <- sample(all_idx[-fold_idx_i], val_size, replace = FALSE)
} else if (val_size > length(all_idx[-fold_idx_i])) {
fold_idx[[i]] <- c(all_idx[-fold_idx_i], sample(all_idx, val_size-length(all_idx[-fold_idx_i]), replace = FALSE))
}
}
fold_data <- lapply(fold_idx, function(i) {
list(training = df1[-i,], valdation = df1[i,])
})
fold_idx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## [[1]]
## [1] 16 3 19 5 15 14
##
## [[2]]
## [1] 13 1 18 4 9 17
##
## [[3]]
## [1] 7 20 6 8 2 12
##
## [[4]]
## [1] 10 11 14 13 7 15
##
## [[5]]
## [1] 3 10 2 5 15 12

以上就是对重抽样方法的简单介绍。

Ref:

https://stats.stackexchange.com/questions/104040/resampling-simulation-methods-monte-carlo-bootstrapping-jackknifing-cross

https://yey.world/2020/08/31/MAST90083-05/