首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何基于字符串模式创建新的urls

如何基于字符串模式创建新的urls
EN

Stack Overflow用户
提问于 2022-11-20 14:08:03
回答 1查看 31关注 0票数 0
代码语言:javascript
复制
df <- data.frame(url c("https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
      "https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c34e8a20217",
      "https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c34e8a20217",
      "https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c34e8a20217","https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c34e8a20217",
      "https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c34e8a20217"),
    page_id=c("0","50","60","70","0","30"))

我有一组带有子页面的URL。子页面遵循模式s-0是第一个子页s-10是第二个,然后是s-20、s-30等等。有许多子页面的URL,我只有第一个和最后一个子页面编号。例如,我可能有s-0和s-70,但没有从s-10到s-60。我想要的是每个URL都有一个子页面标识符。

例如,从上面的数据框架中,我想检索

代码语言:javascript
复制
df <- data.frame(url = c("https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s10.html?sid=674002291c431ba23dd69c34e8a20217"),
"https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s30.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s40.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s10.html?sid=674002291c431ba23dd69c34e8a20217",
"https://tapatalk.com/groups/thee-t143332-s20.html?sid=674002291c431ba23dd69c34e8a20217",
      "https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c34e8a20217"),
page_id=c("0","10","20","30","40","50","60","70","0","10","20","30"))

我已经分离出URL的末尾,并创建了一个具有最大子页面标识符的列,但我不确定还能去哪里。

EN

回答 1

Stack Overflow用户

发布于 2022-11-20 14:38:26

使用regex将页面id和其他url组件拆分为带有tidyr::extract()的单独列;在分组摘要中使用tidyr::full_seq()填充中间的页id;然后将这些列重新组合在一起。

代码语言:javascript
复制
library(tidyr)
library(dplyr)

df %>%
  extract(
    url, 
    into = c("url1", "page_id", "url2"),
    "(.+-s)(\\d+)(\\.html.*)"
  ) %>%
  group_by(url1, url2) %>%
  summarize(page_id = full_seq(as.numeric(page_id), 10), .groups = "drop") %>%
  unite("url", url1, page_id, url2, sep = "")
代码语言:javascript
复制
# A tibble: 12 × 1
   url                                                                          
   <chr>                                                                        
 1 https://tapatalk.com/groups/thee-t119532-s0.html?sid=674002291c431ba23dd69c3…
 2 https://tapatalk.com/groups/thee-t119532-s10.html?sid=674002291c431ba23dd69c…
 3 https://tapatalk.com/groups/thee-t119532-s20.html?sid=674002291c431ba23dd69c…
 4 https://tapatalk.com/groups/thee-t119532-s30.html?sid=674002291c431ba23dd69c…
 5 https://tapatalk.com/groups/thee-t119532-s40.html?sid=674002291c431ba23dd69c…
 6 https://tapatalk.com/groups/thee-t119532-s50.html?sid=674002291c431ba23dd69c…
 7 https://tapatalk.com/groups/thee-t119532-s60.html?sid=674002291c431ba23dd69c…
 8 https://tapatalk.com/groups/thee-t119532-s70.html?sid=674002291c431ba23dd69c…
 9 https://tapatalk.com/groups/thee-t143332-s0.html?sid=674002291c431ba23dd69c3…
10 https://tapatalk.com/groups/thee-t143332-s10.html?sid=674002291c431ba23dd69c…
11 https://tapatalk.com/groups/thee-t143332-s20.html?sid=674002291c431ba23dd69c…
12 https://tapatalk.com/groups/thee-t143332-s30.html?sid=674002291c431ba23dd69c…
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74508805

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档