df <- data.frame(url c("https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c34e8a20217","https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c34e8a20217"),
page_id=c("0","50","60","70","0","30"))我有一组带有子页面的URL。子页面遵循模式s-0是第一个子页s-10是第二个,然后是s-20、s-30等等。有许多子页面的URL,我只有第一个和最后一个子页面编号。例如,我可能有s-0和s-70,但没有从s-10到s-60。我想要的是每个URL都有一个子页面标识符。
例如,从上面的数据框架中,我想检索
df <- data.frame(url = c("https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s10.html?sid=674002291c431ba23dd69c34e8a20217"),
"https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s30.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s40.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s10.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s20.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c34e8a20217"),
page_id=c("0","10","20","30","40","50","60","70","0","10","20","30"))我已经分离出URL的末尾,并创建了一个具有最大子页面标识符的列,但我不确定还能去哪里。
发布于 2022-11-20 14:38:26
使用regex将页面id和其他url组件拆分为带有tidyr::extract()的单独列;在分组摘要中使用tidyr::full_seq()填充中间的页id;然后将这些列重新组合在一起。
library(tidyr)
library(dplyr)
df %>%
extract(
url,
into = c("url1", "page_id", "url2"),
"(.+-s)(\\d+)(\\.html.*)"
) %>%
group_by(url1, url2) %>%
summarize(page_id = full_seq(as.numeric(page_id), 10), .groups = "drop") %>%
unite("url", url1, page_id, url2, sep = "")# A tibble: 12 × 1
url
<chr>
1 https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c3…
2 https://tapatalk.com/groups/thee-t119532-s10.html?sid=674002291c431ba23dd69c…
3 https://tapatalk.com/groups/thee-t119532-s20.html?sid=674002291c431ba23dd69c…
4 https://tapatalk.com/groups/thee-t119532-s30.html?sid=674002291c431ba23dd69c…
5 https://tapatalk.com/groups/thee-t119532-s40.html?sid=674002291c431ba23dd69c…
6 https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c…
7 https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c…
8 https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c…
9 https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c3…
10 https://tapatalk.com/groups/thee-t143332-s10.html?sid=674002291c431ba23dd69c…
11 https://tapatalk.com/groups/thee-t143332-s20.html?sid=674002291c431ba23dd69c…
12 https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c…https://stackoverflow.com/questions/74508805
复制相似问题