0%

R-统计连续变量在各个区间的频次

我们通常会通过直方图来观察数据的结构,但对于如何在R中实际地统计各个区间的数据频次却较少接触。因此,本文总结三种将连续变量离散化的操作:

1
2
3
ggplot2::cut_width()
ggplot2::cut_interval()
ggplot2::cut_number()

数据模拟

首先,我们从正态分布模拟100个数据进行后续分析

1
2
set.seed(123)
dat1 <- tibble(Num=rnorm(100))

cut_width

cut_width()将数据切成长度为width的区间

1
2
3
dat1 %>% 
mutate(Interval=cut_width(dat1$Num, width=0.5)) %>%
count(Interval)
1
2
3
4
5
6
7
8
9
10
11
12
13
# A tibble: 10 x 2
Interval n
<fct> <int>
1 [-2.75,-2.25] 1
2 (-2.25,-1.75] 1
3 (-1.75,-1.25] 4
4 (-1.25,-0.75] 8
5 (-0.75,-0.25] 23
6 (-0.25,0.25] 21
7 (0.25,0.75] 18
8 (0.75,1.25] 13
9 (1.25,1.75] 7
10 (1.75,2.25] 4
1
2
3
4
5
dat1 %>% 
mutate(Interval=cut_width(dat1$Num, width=0.5)) %>%
count(Interval) %>%
ggplot(aes(x=Interval, y=n)) +
geom_bar(stat='identity')

cut_interval

cut_interval()将数据切成 n 个区间

1
2
3
dat1 %>% 
mutate(Interval=cut_interval(dat1$Num, n=10)) %>%
count(Interval)
1
2
3
4
5
6
7
8
9
10
11
12
13
# A tibble: 10 x 2
Interval n
<fct> <int>
1 [-2.31,-1.86] 2
2 (-1.86,-1.41] 2
3 (-1.41,-0.96] 10
4 (-0.96,-0.511] 10
5 (-0.511,-0.0609] 22
6 (-0.0609,0.389] 18
7 (0.389,0.838] 15
8 (0.838,1.29] 11
9 (1.29,1.74] 6
10 (1.74,2.19] 4

cut_number

cut_number()将数据等分,确保每个区间有n个数据。类似于划分为n分位数。

1
2
3
dat1 %>% 
mutate(Interval=cut_number(dat1$Num, n=10)) %>%
count(Interval)
1
2
3
4
5
6
7
8
9
10
11
12
13
# A tibble: 10 x 2
Interval n
<fct> <int>
1 [-2.31,-1.07] 10
2 (-1.07,-0.626] 10
3 (-0.626,-0.387] 10
4 (-0.387,-0.223] 10
5 (-0.223,0.0618] 10
6 (0.0618,0.315] 10
7 (0.315,0.513] 10
8 (0.513,0.882] 10
9 (0.882,1.26] 10
10 (1.26,2.19] 10

Ref:

https://ggplot2.tidyverse.org/reference/cut_interval.html

完。