我们通常会通过直方图来观察数据的结构,但对于如何在R中实际地统计各个区间的数据频次却较少接触。因此,本文总结三种将连续变量离散化的操作:
1 2 3
| ggplot2::cut_width() ggplot2::cut_interval() ggplot2::cut_number()
|
数据模拟
首先,我们从正态分布模拟100个数据进行后续分析
1 2
| set.seed(123) dat1 <- tibble(Num=rnorm(100))
|
cut_width
cut_width()
将数据切成长度为width
的区间
1 2 3
| dat1 %>% mutate(Interval=cut_width(dat1$Num, width=0.5)) %>% count(Interval)
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| Interval n <fct> <int> 1 [-2.75,-2.25] 1 2 (-2.25,-1.75] 1 3 (-1.75,-1.25] 4 4 (-1.25,-0.75] 8 5 (-0.75,-0.25] 23 6 (-0.25,0.25] 21 7 (0.25,0.75] 18 8 (0.75,1.25] 13 9 (1.25,1.75] 7 10 (1.75,2.25] 4
|
1 2 3 4 5
| dat1 %>% mutate(Interval=cut_width(dat1$Num, width=0.5)) %>% count(Interval) %>% ggplot(aes(x=Interval, y=n)) + geom_bar(stat='identity')
|
cut_interval
cut_interval()
将数据切成 n 个区间
1 2 3
| dat1 %>% mutate(Interval=cut_interval(dat1$Num, n=10)) %>% count(Interval)
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| Interval n <fct> <int> 1 [-2.31,-1.86] 2 2 (-1.86,-1.41] 2 3 (-1.41,-0.96] 10 4 (-0.96,-0.511] 10 5 (-0.511,-0.0609] 22 6 (-0.0609,0.389] 18 7 (0.389,0.838] 15 8 (0.838,1.29] 11 9 (1.29,1.74] 6 10 (1.74,2.19] 4
|
cut_number
cut_number()
将数据等分,确保每个区间有n个数据。类似于划分为n分位数。
1 2 3
| dat1 %>% mutate(Interval=cut_number(dat1$Num, n=10)) %>% count(Interval)
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| Interval n <fct> <int> 1 [-2.31,-1.07] 10 2 (-1.07,-0.626] 10 3 (-0.626,-0.387] 10 4 (-0.387,-0.223] 10 5 (-0.223,0.0618] 10 6 (0.0618,0.315] 10 7 (0.315,0.513] 10 8 (0.513,0.882] 10 9 (0.882,1.26] 10 10 (1.26,2.19] 10
|
Ref:
https://ggplot2.tidyverse.org/reference/cut_interval.html
完。