我正在尝试在R中打开和清理一个海量海洋数据集,在R中,观测站信息作为标题散布在观察值之间:
$
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999.0 -9 -9 -9 -9 4868.8 2017 0 7114
2.0 6.0297 35.0199 34.4101 2.0 11111
3.0 6.0279 35.0201 34.4091 3.0 11111
4.0 6.0272 35.0203 34.4091 4.0 11111
5.0 6.0273 35.0204 34.4097 4.9 11111
6.0 6.0274 35.0205 34.4104 5.9 11111
$
2008 1 777 8 17 12 7 25 78.4738 8.3510 27 6 4.1 -999.0 3 7 2 0 4903.8 1570 0 7114
3.0 6.4129 34.5637 34.3541 3.0 11111
4.0 6.4349 34.5748 34.3844 4.0 11111
5.0 6.4803 34.5932 34.4426 4.9 11111
6.0 6.4139 34.5624 34.3552 5.9 11111
7.0 6.5079 34.6097 34.4834 6.9 11111每个$后面跟着一行包含站点数据(例如年份、...、后来、以后、日期、时间),然后是包含在该站点采样的观测值(例如深度、温度、盐度等)的几行。
我想将观测站数据添加到观察值中,这样每个变量就是一列,每个观测值就是一行,如下所示:
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 2 6.0297 35.0199 34.4101 2 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 3 6.0279 35.0201 34.4091 3 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 4 6.0272 35.0203 34.4091 4 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 5 6.0273 35.0204 34.4097 4.9 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 6 6.0274 35.0205 34.4104 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 3 6.4129 34.5637 34.3541 3 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 4 6.4349 34.5748 34.3844 4 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 5 6.4803 34.5932 34.4426 4.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 6 6.4139 34.5624 34.3552 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 7 6.5079 34.6097 34.4834 6.9 11111发布于 2019-11-15 06:36:54
这比较简单,而且只依赖于基数R。我假设您已经先用x <- readLines(....)读取了文本文件:
start <- which(x == "$") + 1 # Find header indices
rows <- diff(c(start, length(x)+2)) - 2 # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)这适用于您在帖子中包含的两个组。您需要修改列名,因为V1 - V6在开头和结尾重复。
发布于 2019-11-15 06:24:55
这个解决方案非常复杂,并且依赖于几个Tidyverse库和特性的知识。我不确定它是否能满足您的需求,但它与您发布的样例配合得很好。但是折叠块的方法,创建函数来解析较小的块,然后展开结果,我认为会很好地服务于您。
第一部分涉及找到“$”标记,将后面的行分组在一起,然后将数据块“嵌套”在一起。然后我们有一个数据框,它只有几行-每个部分一行。
library(tidyverse)
txt_lns <- readLines("ocean-sample.txt")
txt <- tibble(txt = txt_lns)
# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
mutate(row_number = row_number()) %>%
mutate(new_section = str_detect(txt, "\\$")) %>% # Mark new sections
mutate(starting = ifelse(new_section, row_number, NA)) %>% # Index with row num
tidyr::fill(starting) %>% # Fill index down
# where missing
select(-new_section) %>% # Clean up
filter(!str_detect(txt, "\\$")) %>%
nest(data = c(txt, row_number)) # "Nest" the data
# Take a quick look
nested_txt然后,我们需要能够处理这些嵌套的块。这里的例程通过识别标题行,然后将字段分离为它们自己的数据帧来解析这些块。在这里,我们对标题行和较短的较小的行有不同的逻辑。
# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
if (header_ind) {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the header row into 22 variables
separate(txt, into = LETTERS[1:22], sep = "\\s+")
} else {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the lesser rows into 6 variables
separate(txt, into = letters[1:6], sep = "\\s+")
}
return(df)
}
parse_outer_block <- function(x) {
df <- x %>%
# Determine if it's a header row with 22 variables or lesser row with 6
mutate(leading_row = (row_number == min(row_number))) %>%
# Fold by header row vs. not
nest(data = c(txt, row_number)) %>%
# Create data frames for both header and lesser rows
mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
unnest(processed) %>%
# Copy header row values to lesser rows
tidyr::fill(A:V) %>%
# Drop header row
filter(!leading_row)
return(df)
}然后,我们可以将它们放在一起--从嵌套数据开始,处理每个块,取消返回的字段的嵌套,并准备完整的输出。
# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
select(-data) %>%
unnest(proc_out) %>%
select(-starting, -leading_row, -data, -row_number)
output 希望能有所帮助。对于一些类似的问题,我也推荐参考一些purrr教程。
https://stackoverflow.com/questions/58864268
复制相似问题