首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >抓取嵌套链接

抓取嵌套链接
EN

Stack Overflow用户
提问于 2022-06-15 14:47:04
回答 1查看 97关注 0票数 1

我想在应用过滤器之后刮掉http://csla.history.ox.ac.uk/search.php,如下所示

点击'Saint'

  • selecting 'Gaul and Frankish王国‘Birth/Burial'

  • clicking of Birth/Burial’

  • clicking on 'Apply Search'

下的

我挣扎,因为URL没有得到相应的更新。

带有<option value="Gaul">Gaul and Frankish kingdoms</option>的源代码如下所示

代码语言:javascript
复制
<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

在选定的网页上,我想点击以蓝色标记的is (例如,第一个是E06478)。

从当时选择的网页(例如http://csla.history.ox.ac.uk/record.php?recid=E06478)中,我想单击写在表“Related”中的ID (例如,这里的一个是S01319)。

从当时选择的网页(例如http://csla.history.ox.ac.uk/record.php?recid=S01319),我想要刮刮圣徒ID (例如。'S01319'),名称(例如:‘东方号,奥赫主教,第5届“),报告死亡之前没有,报告死亡后,性别,类型的圣徒,并将他们呈现在一个数据。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-06-15 17:51:25

我知道you have asked a similar question before,我将继续先前给出的解决方案

(最初的代码来自this解决方案,在这个扩展中,我们为附加数据创建新列,并使用rvest再次刮取它们)

代码语言:javascript
复制
library(httr)
library(rvest)

items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
              '\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
            "Content-Disposition: form-data; name=\"form[item_", items,
            ']\"\n', contents,
            collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')

type <- paste0('multipart/form-data; boundary=---------------------------',
               '39565121210000504382566389445')

res <- POST('http://csla.history.ox.ac.uk/results.php',
            body = charToRaw(s),
            content_type(type))

df <- res %>%
  read_html() %>%
  html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>%
  html_text() %>%
  matrix(ncol = 2, byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(c('ID', 'Title')) %>%
  dplyr::as_tibble()


urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)

all_results <- list()

for(i in seq_along(urls)) {
  all_results[[i]] <- read_html(urls[i]) %>%
    html_elements("td") %>%
    html_text() %>%
    matrix(ncol = 4, byrow = TRUE) %>%
    as.data.frame() %>%
    setNames(c("ID", "Name", "Name_in_source", "Identity"))
}

final_result <- dplyr::bind_rows(all_results)

# continued solution ----------------------

additional_columns <- c("Name", "Number in BH", "Reported Death Not Before", "Reported Death Not After", "Gender", "Type of Saint")
final_result[, additional_columns] <- NA

for (i in seq_along(final_result$ID)) {
  web_page <- read_html(paste0("http://csla.history.ox.ac.uk/record.php?recid=", final_result$ID[[i]]))
  temp_res <- sapply(additional_columns, function(col) web_page %>%
             html_element(xpath = paste0("//div[contains(text(),'", col, "')]")) %>%
             html_children() %>% html_text())
  final_result[i, additional_columns] <- lapply(temp_res, function(x) ifelse(!length(x), NA, x))
}
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72633483

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档