我们通常会通过直方图来观察数据的结构,但对于如何在R中实际地统计各个区间的数据频次却较少接触。因此,本文总结三种将连续变量离散化的操作:
1 2 3
   | ggplot2::cut_width() ggplot2::cut_interval() ggplot2::cut_number()
   | 
 
数据模拟
首先,我们从正态分布模拟100个数据进行后续分析
1 2
   | set.seed(123) dat1 <- tibble(Num=rnorm(100))
   | 
 
cut_width
cut_width()将数据切成长度为width的区间
1 2 3
   | dat1 %>%    mutate(Interval=cut_width(dat1$Num, width=0.5)) %>%    count(Interval)
   | 
 
1 2 3 4 5 6 7 8 9 10 11 12 13
   |     Interval          n    <fct>         <int>  1 [-2.75,-2.25]     1  2 (-2.25,-1.75]     1  3 (-1.75,-1.25]     4  4 (-1.25,-0.75]     8  5 (-0.75,-0.25]    23  6 (-0.25,0.25]     21  7 (0.25,0.75]      18  8 (0.75,1.25]      13  9 (1.25,1.75]       7 10 (1.75,2.25]       4
 
  | 
 
1 2 3 4 5
   | dat1 %>%    mutate(Interval=cut_width(dat1$Num, width=0.5)) %>%    count(Interval) %>%    ggplot(aes(x=Interval, y=n)) +    geom_bar(stat='identity')
   | 
 

cut_interval
cut_interval()将数据切成 n 个区间
1 2 3
   | dat1 %>%    mutate(Interval=cut_interval(dat1$Num, n=10)) %>%    count(Interval)
   | 
 
1 2 3 4 5 6 7 8 9 10 11 12 13
   |     Interval             n    <fct>            <int>  1 [-2.31,-1.86]        2  2 (-1.86,-1.41]        2  3 (-1.41,-0.96]       10  4 (-0.96,-0.511]      10  5 (-0.511,-0.0609]    22  6 (-0.0609,0.389]     18  7 (0.389,0.838]       15  8 (0.838,1.29]        11  9 (1.29,1.74]          6 10 (1.74,2.19]          4
 
  | 
 

cut_number
cut_number()将数据等分,确保每个区间有n个数据。类似于划分为n分位数。
1 2 3
   | dat1 %>%    mutate(Interval=cut_number(dat1$Num, n=10)) %>%    count(Interval)
   | 
 
1 2 3 4 5 6 7 8 9 10 11 12 13
   |     Interval            n    <fct>           <int>  1 [-2.31,-1.07]      10  2 (-1.07,-0.626]     10  3 (-0.626,-0.387]    10  4 (-0.387,-0.223]    10  5 (-0.223,0.0618]    10  6 (0.0618,0.315]     10  7 (0.315,0.513]      10  8 (0.513,0.882]      10  9 (0.882,1.26]       10 10 (1.26,2.19]        10
 
  | 
 

Ref:
https://ggplot2.tidyverse.org/reference/cut_interval.html
完。