假设我有这个数据:
df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18",
"males_rating_18-29", "males_rating_30-44", "males_rating_45+",
"males_count_all_ages", "males_count_<18", "males_count_18-29",
"males_count_30-44", "males_count_45+", "females_rating_all_ages",
"females_rating_<18", "females_rating_18-29", "females_rating_30-44",
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2",
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7",
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))我想提取count列的性别、年龄和类型(即gender_age或rating),并将它们放在自己的列中。
到目前为止,我有这样的代码:
df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>%
mutate(age = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))
# A tibble: 15 x 4
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males _all_ages
2 males_rating_<18 8.0 males _<18
3 males_rating_18-29 7.5 males _18-29
4 males_rating_30-44 7.2 males _30-44
5 males_rating_45+ 7.5 males _45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 femal ng_all_ages
12 females_rating_<18 8.7 femal ng_<18
13 females_rating_18-29 7.7 femal ng_18-29
14 females_rating_30-44 7.8 femal ng_30-44
15 females_rating_45+ 8.1 femal ng_45+
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length 但是可以看到,对于数据的每一行,str_locate_all()都在相同的固定值上进行索引。显然,这并不理想,因为第二个下划线_之前的字符数是不同的。
例如:
> str_locate_all("males_rating_all_ages", "_")
[[1]]
start end
[1,] 6 6
[2,] 13 13
[3,] 17 17因此,我必须首先在[[1]]上索引,然后对矩阵的特定行进行索引(在我的例子中,[2,]只获得一个可以输入到str_sub()表达式中的值。
但如果我跑了
> str_locate_all("females_rating_all_ages", "_")
[[1]]
start end
[1,] 8 8
[2,] 15 15
[3,] 19 19我们可以看到,当下划线前面有更多的字符时,矩阵就表明了这一点。但是,对于我在mutate函数中创建的新列,它似乎已经接受了所有后续行的第一行索引。
有人能看到我做错了什么吗?或者提出一种从gender_age中提取我想要的三列的替代方法(最好使用str_函数)?
发布于 2021-07-23 18:04:23
与使用str_locate不同,使用基于regex模式捕获组的extract可能更容易
library(dplyr)
library(stringr)
df %>%
extract(gender_age, into = c("gender", "age"),
"^([^_]+)_[^_]+_(.*)", remove = FALSE)-ouptut
# A tibble: 15 x 4
gender_age gender age count
<chr> <chr> <chr> <chr>
1 males_rating_all_ages males all_ages 7.4
2 males_rating_<18 males <18 8.0
3 males_rating_18-29 males 18-29 7.5
4 males_rating_30-44 males 30-44 7.2
5 males_rating_45+ males 45+ 7.5
6 males_count_all_ages males all_ages 4,197
7 males_count_<18 males <18 15
8 males_count_18-29 males 18-29 1,276
9 males_count_30-44 males 30-44 1,631
10 males_count_45+ males 45+ 921
11 females_rating_all_ages females all_ages 7.8
12 females_rating_<18 females <18 8.7
13 females_rating_18-29 females 18-29 7.7
14 females_rating_30-44 females 30-44 7.8
15 females_rating_45+ females 45+ 8.1 OP代码中的问题是为list选择第一个带有[[的str_locate_all元素。如果list为length 1,它可以工作,但是,这里的list长度与数据的行数相同,而thuse [[1]]将选择第一行观察。可以在rowwise步骤之前使用mutate纠正这一问题。
df %>%
rowwise %>%
mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>%
mutate(age = str_sub(gender_age, str_locate_all(gender_age,
"_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise:
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males all_ages
2 males_rating_<18 8.0 males <18
3 males_rating_18-29 7.5 males 18-29
4 males_rating_30-44 7.2 males 30-44
5 males_rating_45+ 7.5 males 45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 females all_ages
12 females_rating_<18 8.7 females <18
13 females_rating_18-29 7.7 females 18-29
14 females_rating_30-44 7.8 females 30-44
15 females_rating_45+ 8.1 females 45+ 然后删除.$ (它选择整个列),或者另一个选项是使用map循环list,从matrix输出中获取感兴趣的列
https://stackoverflow.com/questions/68503358
复制相似问题