本文将介绍dplyr包中几个基础的函数，掌握后可以应付一些基础的数据处理工作了，包括：

filter()：筛选出包含特定值的数据，类似excel的ctrl + F
arrange() ：重新排序数据框中的行
select() ：按名称选取变量
mutate()：使用现有变量创建新变量
%>%: 管道符，将左侧的变量或函数返回结果传递到右侧

以下代码使用了nycflights13包中的数据进行演示

1
2
3

install.packages('nycflights13')
library(dplyr)
library(nycflights13)

nycflights13 contains following datasets:

airlines: 航空公司的名字及它们的代码

airports: 机场信息

flights: 航班信息

planes: 飞机信息

weather: 天气

筛选 | filter

首先，看一下filter()函数

filter(.data, ..., .preserve = FALSE)


#...	Logical predicates defined in terms of the variables in .data. 
#       Multiple conditions are combined with &. 
#       Only rows where the condition evaluates to TRUE are kept.

不难看出，该函数就是有两大部分组成，一是输入的数据；二是逻辑判断，只有符合条件（TRUE）的行才能保留下来。

接着，我们尝试利用filter()提取出一月一号的航班信息：

> jan1 <- filter(flights, month == 1, day == 1)

# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      544            545        -1     1004           1022       -18
 5  2013     1     1      554            600        -6      812            837       -25
 6  2013     1     1      554            558        -4      740            728        12
 7  2013     1     1      555            600        -5      913            854        19
 8  2013     1     1      557            600        -3      709            723       -14
 9  2013     1     1      557            600        -3      838            846        -8
10  2013     1     1      558            600        -2      753            745         8
# ... with 832 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

排序 | arrange

arrange()函数的代码也十分简洁，接受数据，按照指定的列以升序或降序对行进行重新排列。

#如果有多个排序的列，则后面的列在前面排序的基础上继续排序

> arrange(flights, year, month, day)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      544            545        -1     1004           1022       -18
 5  2013     1     1      554            600        -6      812            837       -25
 6  2013     1     1      554            558        -4      740            728        12
 7  2013     1     1      555            600        -5      913            854        19
 8  2013     1     1      557            600        -3      709            723       -14
 9  2013     1     1      557            600        -3      838            846        -8
10  2013     1     1      558            600        -2      753            745         8
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

结合desc()函数则可实现降序排序：

> arrange(flights, desc(day))

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1    31        1           2100       181      124           2225       179
 2  2013     1    31        4           2359         5      455            444        11
 3  2013     1    31        7           2359         8      453            437        16
 4  2013     1    31       12           2250        82      132              7        85
 5  2013     1    31       26           2154       152      328             50       158
 6  2013     1    31       34           2159       155      135           2315       140
 7  2013     1    31       37           2249       108      132           2357        95
 8  2013     1    31       54           2250       124      152           2359       113
 9  2013     1    31      453            500        -7      651            648         3
10  2013     1    31      522            525        -3      820            820         0
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

选择列 | select

select()函数可选择列名相应的列，以下返回flights数据集中的年月日

> select(flights, year, month, day)

# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows

还可以搭配一些辅助函数一同使用

starts_with("a") ：挑选列名为“a”开头的

> flights %>% select(starts_with('a'))
# A tibble: 336,776 x 3
   arr_time arr_delay air_time
      <int>     <dbl>    <dbl>
 1      830        11      227
 2      850        20      227
 3      923        33      160
 4     1004       -18      183
 5      812       -25      116
 6      740        12      150
 7      913        19      158
 8      709       -14       53
 9      838        -8      140
10      753         8      138
# ... with 336,766 more rows

ends_with("y") ：挑选列名“y”结尾的

> flights %>% select(ends_with('y'))
# A tibble: 336,776 x 3
     day dep_delay arr_delay
   <int>     <dbl>     <dbl>
 1     1         2        11
 2     1         4        20
 3     1         2        33
 4     1        -1       -18
 5     1        -6       -25
 6     1        -4        12
 7     1        -5        19
 8     1        -3       -14
 9     1        -3        -8
10     1        -2         8
# ... with 336,766 more rows

contains("m") ：挑选列名包含“m”的列

> flights %>% select(contains('m')) %>% print(n=3)
# A tibble: 336,776 x 9
  month dep_time sched_dep_time arr_time sched_arr_time tailnum air_time minute time_hour          
  <int>    <int>          <int>    <int>          <int> <chr>      <dbl>  <dbl> <dttm>             
1     1      517            515      830            819 N14228       227     15 2013-01-01 05:00:00
2     1      533            529      850            830 N24211       227     29 2013-01-01 05:00:00
3     1      542            540      923            850 N619AA       160     40 2013-01-01 05:00:00
# ... with 336,773 more rows

matches(): 使用正则表达式进行列名匹配，返回匹配的列. 以下例子返回列名以”a”起始或”r”结尾的列

> flights %>% select(matches('^a|r$')) %>% print(n=3)
# A tibble: 336,776 x 7
   year arr_time arr_delay carrier air_time  hour time_hour          
  <int>    <int>     <dbl> <chr>      <dbl> <dbl> <dttm>             
1  2013      830        11 UA           227     5 2013-01-01 05:00:00
2  2013      850        20 UA           227     5 2013-01-01 05:00:00
3  2013      923        33 AA           160     5 2013-01-01 05:00:00
# ... with 336,773 more rows

everything()：所有的列，但不会重复已经指明的列。可以将某些列排到较前的位置

> flights %>% select(hour, minute, everything()) %>% print(n=3)
# A tibble: 336,776 x 19
   hour minute  year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <dbl>  <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1     5     15  2013     1     1      517            515         2      830            819
2     5     29  2013     1     1      533            529         4      850            830
3     5     40  2013     1     1      542            540         2      923            850
# ... with 336,773 more rows, and 9 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, time_hour <dttm>

num_range("x", 1:3)：挑选名称为“x1”，“x2”，“x3”的列

另外，select函数还支持R的一些操作符，包括

: 例如select(1:3)选择前三列
! 返回非匹配的列，例如select(!ends_with("y"))返回列名不以y结尾的
& and | 返回匹配的交集或并集，例如select(starts_with('a') & ends_with('y'))返回列名以”a”起始且以”y”结尾的
c() 组合匹配，例如select(c(starts_with('a'),ends_with('y')))返回列名以”a”起始或以”y”结尾的

添加新的变量 | mutate

mutate()函数可以使用已有的变量直接在数据框中生成新的变量，要传入的除了数据之外，还应包括新变量的赋值式。

#随便生成个副本数据框
> cop <- select(flights, ends_with("delay"))
> head(cop)

# A tibble: 6 x 2
  dep_delay arr_delay
      <dbl>     <dbl>
1         2        11
2         4        20
3         2        33
4        -1       -18
5        -6       -25
6        -4        12

#使用mutate函数创建新变量
> cop <- mutate(cop, x3 = dep_delay + arr_delay)

# A tibble: 336,776 x 3
   dep_delay arr_delay    x3
       <dbl>     <dbl> <dbl>
 1         2        11    13
 2         4        20    24
 3         2        33    35
 4        -1       -18   -19
 5        -6       -25   -31
 6        -4        12     8
 7        -5        19    14
 8        -3       -14   -17
 9        -3        -8   -11
10        -2         8     6
# ... with 336,766 more rows

串联数据流分析 | %>%

上述简单介绍dplyr包的几个常用数据分析函数，这几个函数足以应对60%的数据分析场景，但是通常在数据分析当中我们会有许多中间变量的产生，而这些变量我们在之后的分析当中很可能也不会再用到。这个时候中间变量的产生会占据我们的运行资源，同时也很浪费起名字的功夫。管道符%>%提供了这样的可能，它可以将多个函数串联起来，省去中间变量生成的必要。

对于管道符%>%最直白的解释就是它会将左边的东西传递到右边，例如我将变量flights传递到print()函数进行打印

> flights %>% print(n=2)
# A tibble: 336,776 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013     1     1      517            515         2      830            819        11 UA     
2  2013     1     1      533            529         4      850            830        20 UA     
# ... with 336,774 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

这里对%>%进行一点补充。默认情况%>%传递到右侧函数的第一位参数，也可以使用占位符.指定传入的位置，例如x %>% write.table(file='test.csv', x = .).

更多用法可以参考我关于管道符的笔记R-管道符或magittr的官方文档

再以一个数据分析的例子，展示管道符%>%的优点。

例如，我们想知道在飞行距离小于1500英里之内的航班中，综合考虑出发延迟dep_delay和到达延迟arr_delay的情况下，拖延症最严重的top 10航班是哪天的。所以，我们需要：

从flights数据集中选择年月日，飞行距离及出发和到达延迟信息
筛选飞行距离小于1500的航班
对出发延迟和到达延迟进行排序
创建新的变量delay表示拖延等级

flights %>% 
  select(year, month, day, distance, dep_delay, arr_delay) %>% 
  filter(distance < 1500) %>% 
  arrange(desc(dep_delay), desc(arr_delay)) %>% 
  mutate(delay=1:nrow(.))
# A tibble: 264,063 x 7
    year month   day distance dep_delay arr_delay delay
   <int> <int> <int>    <dbl>     <dbl>     <dbl> <int>
 1  2013     6    15      483      1137      1127     1
 2  2013     1    10      719      1126      1109     2
 3  2013     7    22      589      1005       989     3
 4  2013     4    10     1005       960       931     4
 5  2013     3    17     1020       911       915     5
 6  2013     7    22      762       898       895     6
 7  2013    12     5     1085       896       878     7
 8  2013     5     3      719       878       875     8
 9  2013     1     1      184       853       851     9
10  2013    12    17     1085       845       846    10
# ... with 264,053 more rows

通过以上分析表明2013年6月15的航班拖延情况最严重，出发延迟1137分钟，到达延迟1127分钟，而且只是个483英里（773公里）的航班。更重要的是，我们不需要生成任何中间变量，而只关注于执行的函数操作就完成了分析。

本文仅是展露数据处理函数的冰山一角。

完。

Dean's blog

R-数据处理基础-dplyr

筛选 | filter

排序 | arrange

选择列 | select

添加新的变量 | mutate

串联数据流分析 | %>%