本文简单记录R中常用的数据读写工具
数据读入
read.table
是R语言中的数据读取函数,可以读取多种形式的表格。
以下是其默认的参数设置,这次文章先记下我个人常用的参数。
1 2 3 4 5 6 7 8 9 10
| read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
|
file
: 是读入的文件,可以是绝对路径,也可以通过setwd()
改变目录后的文件名称。
header
: 逻辑判断参数,为True
时,将第一行认为是列名。
sep
: 识别列与列之间分隔的字符形式通过此参数设置。有时候文件中的空格有可能会是制表符(\t
),所以要分清楚列之间分隔的字符形式。
comment.char
:默认情况下,read.table 用 # 作为注释标识字符。如果碰到该字符(除了在被引用的字符串内),该行中随后的内容将会被忽略。只含有空白和注释的行被当作空白行。如果确认数据文件中没有注释内容,用 comment.char = “” 会比较安全 ,也会让读入速度增加。
row.names
:可输入向量作为行名。想要使用第一列作为行名时,输入row.names = x[,1]
colClasses
:可以输入一组向量改变读入数据中列的类型,如果要指定改名某列变量,要指明列的名称,例如colClasses = c('x' = 'character')
即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| > a <- read.table("c27-ha_go_bp.txt", sep = "\t", header = T) > str(a) 'data.frame': 233 obs. of 8 variables: $ GO.biological.process.complete: Factor w/ 233 levels "adaptive immune response (GO:0002250)",..: 13 19 55 33 52 164 136 88 120 85 ... $ Homo.sapiens...REFLIST..20996.: int 1315 959 538 7634 482 7492 7841 678 6985 8389 ... $ upload_1..1992. : int 244 196 1 934 0 917 950 152 861 997 ... $ upload_1..expected. : num 124.8 91 51 724.3 45.7 ... $ upload_1..over.under. : Factor w/ 2 levels "-","+": 2 2 1 2 1 2 2 2 2 2 ... $ upload_1..fold.Enrichment. : Factor w/ 135 levels " < 0.01","0.02",..: 71 80 2 31 1 31 30 95 32 27 ... $ upload_1..raw.P.value. : num 1.37e-20 1.64e-20 3.55e-20 4.85e-20 1.15e-19 ... $ upload_1..FDR. : num 2.17e-16 1.29e-16 1.87e-16 1.92e-16 3.64e-16 ...
> a <- read.table("c27-ha_go_bp.txt", sep = "\t", header = T, + colClasses = c("upload_1..over.under." = "character") + ) > str(a) 'data.frame': 233 obs. of 8 variables: $ GO.biological.process.complete: Factor w/ 233 levels "adaptive immune response (GO:0002250)",..: 13 19 55 33 52 164 136 88 120 85 ... $ Homo.sapiens...REFLIST..20996.: int 1315 959 538 7634 482 7492 7841 678 6985 8389 ... $ upload_1..1992. : int 244 196 1 934 0 917 950 152 861 997 ... $ upload_1..expected. : num 124.8 91 51 724.3 45.7 ... $ upload_1..over.under. : chr "+" "+" "-" "+" ... $ upload_1..fold.Enrichment. : Factor w/ 135 levels " < 0.01","0.02",..: 71 80 2 31 1 31 30 95 32 27 ... $ upload_1..raw.P.value. : num 1.37e-20 1.64e-20 3.55e-20 4.85e-20 1.15e-19 ... $ upload_1..FDR. : num 2.17e-16 1.29e-16 1.87e-16 1.92e-16 3.64e-16 ...
|
读入.xls/.xlsx数据
如果需要读入存储于.xls
or .xlsx
格式中的数据为data.frame
,可以借助readxl
包
1 2 3 4
| install.packages("readxl") library(readxl) dat <- read_excel(path = "dataset.xlsx", sheet = "sheet1")
|
这里需要输入的是文件的路径,以及通过sheet =
参数指出要读入表格的哪一页
数据导出
在R中处理完数据后,如果我们要将数据导出,可以通过write.table()
实现
1 2 3 4
| write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "")
|
x
:导出的数据
file
:导出的文件路径
sep
:指定文件的分隔符,逗号”,”则为csv;制表符”\t”则为tsv
注意要匹配导出文件的后缀名与分隔符。
导出为.xls/.xlsx
如果要将数据导出为.xls
or .xlsx
格式,需要使用xlsx
包
1 2 3 4 5 6
| install.packages('xlsx') library(xlsx) write.xlsx(x, file, sheetName = "Sheet1", col.names = TRUE, row.names = TRUE, append = FALSE) write.xlsx2(x, file, sheetName = "Sheet1", col.names = TRUE, row.names = TRUE, append = FALSE)
|
其中write.xlsx2()
函数写大文件更快
Ref:
https://readxl.tidyverse.org/
http://www.sthda.com/english/wiki/reading-data-from-excel-files-xls-xlsx-into-r
http://www.sthda.com/english/wiki/writing-data-from-r-to-excel-files-xls-xlsx
完。