我想在应用过滤器之后刮掉http://csla.history.ox.ac.uk/search.php,如下所示
点击'Saint'
下的
我挣扎,因为URL没有得到相应的更新。
带有<option value="Gaul">Gaul and Frankish kingdoms</option>的源代码如下所示
<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>在选定的网页上,我想点击以蓝色标记的is (例如,第一个是E06478)。
从当时选择的网页(例如http://csla.history.ox.ac.uk/record.php?recid=E06478)中,我想单击写在表“Related”中的ID (例如,这里的一个是S01319)。
从当时选择的网页(例如http://csla.history.ox.ac.uk/record.php?recid=S01319),我想要刮刮圣徒ID (例如。'S01319'),名称(例如:‘东方号,奥赫主教,第5届“),报告死亡之前没有,报告死亡后,性别,类型的圣徒,并将他们呈现在一个数据。。
发布于 2022-06-15 17:51:25
我知道you have asked a similar question before,我将继续先前给出的解决方案
(最初的代码来自this解决方案,在这个扩展中,我们为附加数据创建新列,并使用rvest再次刮取它们)
library(httr)
library(rvest)
items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
'\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
"Content-Disposition: form-data; name=\"form[item_", items,
']\"\n', contents,
collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')
type <- paste0('multipart/form-data; boundary=---------------------------',
'39565121210000504382566389445')
res <- POST('http://csla.history.ox.ac.uk/results.php',
body = charToRaw(s),
content_type(type))
df <- res %>%
read_html() %>%
html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>%
html_text() %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c('ID', 'Title')) %>%
dplyr::as_tibble()
urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)
all_results <- list()
for(i in seq_along(urls)) {
all_results[[i]] <- read_html(urls[i]) %>%
html_elements("td") %>%
html_text() %>%
matrix(ncol = 4, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c("ID", "Name", "Name_in_source", "Identity"))
}
final_result <- dplyr::bind_rows(all_results)
# continued solution ----------------------
additional_columns <- c("Name", "Number in BH", "Reported Death Not Before", "Reported Death Not After", "Gender", "Type of Saint")
final_result[, additional_columns] <- NA
for (i in seq_along(final_result$ID)) {
web_page <- read_html(paste0("http://csla.history.ox.ac.uk/record.php?recid=", final_result$ID[[i]]))
temp_res <- sapply(additional_columns, function(col) web_page %>%
html_element(xpath = paste0("//div[contains(text(),'", col, "')]")) %>%
html_children() %>% html_text())
final_result[i, additional_columns] <- lapply(temp_res, function(x) ifelse(!length(x), NA, x))
}https://stackoverflow.com/questions/72633483
复制相似问题