0%

R-数据的读与写

本文简单记录R中常用的数据读写工具

数据读入

read.table 是R语言中的数据读取函数,可以读取多种形式的表格。
以下是其默认的参数设置,这次文章先记下我个人常用的参数。

1
2
3
4
5
6
7
8
9
10
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

file : 是读入的文件,可以是绝对路径,也可以通过setwd()改变目录后的文件名称。
header : 逻辑判断参数,为True时,将第一行认为是列名。
sep : 识别列与列之间分隔的字符形式通过此参数设置。有时候文件中的空格有可能会是制表符(\t),所以要分清楚列之间分隔的字符形式。
comment.char :默认情况下,read.table 用 # 作为注释标识字符。如果碰到该字符(除了在被引用的字符串内),该行中随后的内容将会被忽略。只含有空白和注释的行被当作空白行。如果确认数据文件中没有注释内容,用 comment.char = “” 会比较安全 ,也会让读入速度增加。
row.names :可输入向量作为行名。想要使用第一列作为行名时,输入row.names = x[,1]
colClasses :可以输入一组向量改变读入数据中列的类型,如果要指定改名某列变量,要指明列的名称,例如colClasses = c('x' = 'character') 即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> a <- read.table("c27-ha_go_bp.txt", sep = "\t", header = T)
> str(a)
'data.frame': 233 obs. of 8 variables:
$ GO.biological.process.complete: Factor w/ 233 levels "adaptive immune response (GO:0002250)",..: 13 19 55 33 52 164 136 88 120 85 ...
$ Homo.sapiens...REFLIST..20996.: int 1315 959 538 7634 482 7492 7841 678 6985 8389 ...
$ upload_1..1992. : int 244 196 1 934 0 917 950 152 861 997 ...
$ upload_1..expected. : num 124.8 91 51 724.3 45.7 ...
$ upload_1..over.under. : Factor w/ 2 levels "-","+": 2 2 1 2 1 2 2 2 2 2 ...
$ upload_1..fold.Enrichment. : Factor w/ 135 levels " < 0.01","0.02",..: 71 80 2 31 1 31 30 95 32 27 ...
$ upload_1..raw.P.value. : num 1.37e-20 1.64e-20 3.55e-20 4.85e-20 1.15e-19 ...
$ upload_1..FDR. : num 2.17e-16 1.29e-16 1.87e-16 1.92e-16 3.64e-16 ...

#将特定列的因子类型改为字符型
> a <- read.table("c27-ha_go_bp.txt", sep = "\t", header = T,
+ colClasses = c("upload_1..over.under." = "character")
+ )
> str(a)
'data.frame': 233 obs. of 8 variables:
$ GO.biological.process.complete: Factor w/ 233 levels "adaptive immune response (GO:0002250)",..: 13 19 55 33 52 164 136 88 120 85 ...
$ Homo.sapiens...REFLIST..20996.: int 1315 959 538 7634 482 7492 7841 678 6985 8389 ...
$ upload_1..1992. : int 244 196 1 934 0 917 950 152 861 997 ...
$ upload_1..expected. : num 124.8 91 51 724.3 45.7 ...
$ upload_1..over.under. : chr "+" "+" "-" "+" ...
$ upload_1..fold.Enrichment. : Factor w/ 135 levels " < 0.01","0.02",..: 71 80 2 31 1 31 30 95 32 27 ...
$ upload_1..raw.P.value. : num 1.37e-20 1.64e-20 3.55e-20 4.85e-20 1.15e-19 ...
$ upload_1..FDR. : num 2.17e-16 1.29e-16 1.87e-16 1.92e-16 3.64e-16 ...

读入.xls/.xlsx数据

如果需要读入存储于.xls or .xlsx格式中的数据为data.frame,可以借助readxl

1
2
3
4
# Installation
install.packages("readxl")
library(readxl)
dat <- read_excel(path = "dataset.xlsx", sheet = "sheet1")

这里需要输入的是文件的路径,以及通过sheet = 参数指出要读入表格的哪一页

数据导出

在R中处理完数据后,如果我们要将数据导出,可以通过write.table()实现

1
2
3
4
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

x:导出的数据

file:导出的文件路径

sep:指定文件的分隔符,逗号”,”则为csv;制表符”\t”则为tsv

注意要匹配导出文件的后缀名与分隔符。

导出为.xls/.xlsx

如果要将数据导出为.xls or .xlsx格式,需要使用xlsx

1
2
3
4
5
6
install.packages('xlsx')
library(xlsx)
write.xlsx(x, file, sheetName = "Sheet1",
col.names = TRUE, row.names = TRUE, append = FALSE)
write.xlsx2(x, file, sheetName = "Sheet1",
col.names = TRUE, row.names = TRUE, append = FALSE)

其中write.xlsx2()函数写大文件更快

Ref:

https://readxl.tidyverse.org/

http://www.sthda.com/english/wiki/reading-data-from-excel-files-xls-xlsx-into-r

http://www.sthda.com/english/wiki/writing-data-from-r-to-excel-files-xls-xlsx

完。