我遇到了一个非常奇怪的数据帧结构,下面是一个示例
# A tibble: 3 x 4
ColName...1 Sample1 ColName...2 Sample2
<chr> <dbl> <chr> <dbl>
1 A 1 A 4
2 B 2 B 5
3 NA NA C 6其代码可从以下位置获得:
ColName...1 <- c("A","B",NA)
Sample1 <- c(1,2,NA)
ColName...2 <- c("A","B","C")
Sample2 <- c(4,5,6)我希望将我的数据转换为更传统的格式:
A B C Sample
1 1 2 NA 1
2 4 5 6 2可从以下地址获得:
# Desired output
df <- data.frame(c(1,4),c(2,5),c(NA,6),c(1,2))
colnames(df) <- c("A","B","C","Sample")
df 换句话说,我需要告诉R ColName...1、ColName...2等是包含数据帧名称的变量,我需要列Sample1、Sample2等。以使它们成为此数据帧中的行。我如何对此进行编码?
编辑:我实际使用的数据帧更加混乱。下面是它的外观:
# A tibble: 10 x 6
Element...1 GeoPT8 Element...3 GeoPT9 Element...5 GeoPT10
<chr> <dbl> <chr> <dbl> <chr> <dbl>
1 SiO2 66 SiO2 59 SiO2 64
2 TiO2 67 TiO2 63 TiO2 69
3 Al2O3 69 Al2O3 63 Al2O3 71
4 Fe2O3 71 Fe2O3 68 Fe2O3 74
5 Fe(II)O 16 Fe(II)O 17 MnO 73
6 MnO 70 MnO 68 MgO 70
7 MgO 69 MgO 64 CaO 73
8 CaO 70 CaO 65 Na2O 73
9 Na2O 71 Na2O 66 P2O5 60
10 K2O 69 K2O 64 LOI 54获取此数据帧的代码:
df <- structure(list(Element...1 = c("SiO2", "TiO2", "Al2O3", "Fe2O3", "Fe(II)O", "MnO", "MgO", "CaO", "Na2O", "K2O"),
GeoPT8 = c(66,67, 69, 71, 16, 70, 69, 70, 71, 69),
Element...3 = c("SiO2", "TiO2", "Al2O3", "Fe2O3", "Fe(II)O", "MnO", "MgO", "CaO", "Na2O", "K2O"),
GeoPT9 = c(59, 63, 63, 68, 17, 68, 64, 65, 66, 64),
Element...5 = c("SiO2", "TiO2", "Al2O3", "Fe2O3", "MnO", "MgO", "CaO", "Na2O", "P2O5", "LOI"),
GeoPT10 = c(64, 69, 71, 74, 73, 70, 73, 73, 60, 54)), row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))可以看到,列Element...1和Element...5不匹配(Element...5包含MnO,但Element...1不匹配)。我怎么能对R说,包含GeoPT8键的列是元素...1,包含GeoPT10键的列是元素...5,依此类推?
发布于 2021-04-14 03:58:48
一种选择是使用pivot_longer重塑为“long”,然后重新整形为“wide”
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c(".value", "grp"),
names_sep = "(?<=\\D)(?=\\d)", values_drop_na = TRUE) %>%
pivot_wider(names_from = ColName..., values_from = Sample) %>%
select(-grp, everything(), Sample = grp)-output
# A tibble: 2 x 4
# A B C Sample
# <int> <int> <int> <chr>
#1 1 2 NA 1
#2 4 5 6 2 或者,这可以通过在base R中使用名称的列子集对数据子集进行t随机排列并设置名称来完成
nm1 <- do.call(pmax, c(df1[c(TRUE, FALSE)], na.rm = TRUE))
setNames(as.data.frame(t(unname(df1[c(FALSE, TRUE)]))), nm1)更新
如果列名不匹配,则从...列中提取后缀数字并将其粘贴到Geo列上,假设它们的顺序相同
library(stringr)
v1 <- str_extract(names(df)[c(TRUE, FALSE)], "\\d+$")
df %>%
rename_at(vars(starts_with('Geo')),
~ str_replace(., '\\d+$', str_c("...", v1)) ) %>%
pivot_longer(cols = everything(), names_to = c(".value", "grp"),
names_sep = "(?<=\\.{3})(?=\\d$)", values_drop_na = TRUE) %>%
pivot_wider(names_from = 'Element...', values_from = 'GeoPT...') %>%
select(-grp, everything(), Sample = grp)
# A tibble: 3 x 13
# SiO2 TiO2 Al2O3 Fe2O3 `Fe(II)O` MnO MgO CaO Na2O P2O5 K2O LOI Sample
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 66 67 69 71 16 70 69 70 71 NA 69 NA 1
#2 59 63 63 68 17 68 64 65 66 NA 64 NA 3
#3 64 69 71 74 NA 73 70 73 73 60 NA 54 5 数据
df1 <- structure(list(ColName...1 = c("A", "B", NA), Sample1 = c(1L,
2L, NA), ColName...2 = c("A", "B", "C"), Sample2 = 4:6),
class = "data.frame", row.names = c("1",
"2", "3"))发布于 2021-04-14 04:05:05
这比akrun的解决方案要长。
library(tidyverse)
df %>%
as_tibble() %>%
pivot_longer(
cols = starts_with("Sample"),
names_to = "names",
values_to = "values"
) %>%
select(-ColName...1, -names) %>%
type.convert(is.as = TRUE) %>%
group_by(ColName...2) %>%
mutate(row = row_number()) %>%
pivot_wider(
names_from = "ColName...2",
values_from = "values"
) %>%
select(-row)输出:
A B C
<int> <int> <int>
1 1 2 NA
2 4 5 6https://stackoverflow.com/questions/67081470
复制相似问题