我想在应用过滤器之后刮掉http://csla.history.ox.ac.uk/search.php,如下所示
点击'Saint'
下的
我挣扎,因为URL没有得到相应的更新。
带有<option value="Gaul">Gaul and Frankish kingdoms</option>的源代码如下所示
<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>从选定的网页中,我想访问用蓝色编写的ID,即第一个ID是E06478。
发布于 2022-06-12 15:25:56
这是个棘手的问题。您需要将查询POST到服务器,并且查询需要以非常特殊的格式进行。您可以像这样从页面中获得html:
library(httr)
library(rvest)
items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
'\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
"Content-Disposition: form-data; name=\"form[item_", items,
']\"\n', contents,
collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')
type <- paste0('multipart/form-data; boundary=---------------------------',
'39565121210000504382566389445')
res <- POST('http://csla.history.ox.ac.uk/results.php',
body = charToRaw(s),
content_type(type))要在一个整洁的数据框架中获得所有结果,您可以这样做:
df <- res %>%
read_html() %>%
html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>%
html_text() %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c('ID', 'Title')) %>%
dplyr::as_tibble()这将获得数据框架中的所有引用ID。要获取实际页面,我们使用这些查询字符串:
urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)现在我们需要遍历所有的900+页面来提取表格数据。在循环中这样做是最安全的,然后在最后将列表绑定在一起:
all_results <- list()
for(i in seq_along(urls)) {
all_results[[i]] <- read_html(urls[i]) %>%
html_elements("td") %>%
html_text() %>%
matrix(ncol = 4, byrow = TRUE) %>%
as.data.frame() %>%
setNames(c("ID", "Name", "Name_in_source", "Identity"))
}
final_result <- dplyr::bind_rows(all_results)最后的结果是现在有超过3000行的数据帧。以下是前3条:
head(final_result, 3)
#> ID Name Name_in_source Identity
#> 1 S01319 Orientius, bishop of Auch, 5th c. Certain
#> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475 Certain
#> 3 S00316 Martyrs of Lyon Certain有些ID是重复的,因为它们出现在多个页面中。您可以使用unique删除这些内容。还请注意,当您将数据帧打印到控制台时,希腊字母将显示为Unicode转义序列。然而,文本仍然存在于底层向量中。例如:
head(final_result[3])
#> Name_in_source
#> 1
#> 2
#> 3
#> 4
#> 5 <U+03A0><U+03BF><U+03BB><U+03CD><U+03BA>a<U+03C1>p<U+03BF><U+03C2>
#> 6 <U+03A0><U+03B9><U+03CC><U+03BD><U+03B9><U+03BF><U+03C2>但
final_result[1:6, 3]
#> [1] "" "" "" "" "Πολύκαρπος" "Πιόνιος" 发布于 2022-06-15 18:51:15
仅作为参考,httr (或至少httr2)知道如何发布multipart/form-data,因此处理这些表单就不那么可怕了:
library(rvest)
library(httr2)
# multipart/form-data POST with httr2
request("http://csla.history.ox.ac.uk/results.php") %>%
req_body_multipart(
`form[item_998]` = "E",
`form[item_89]` = "Gaul",
`form[item_999]` = "Or"
) %>%
req_perform() %>%
resp_body_string() %>%
# table html is broken, fix rows:
gsub("</tr></tr>", "</tr><tr>", .) %>%
minimal_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 1,210 × 2
#> ID Title
#> <chr> <chr>
#> 1 E02204 Calendar of the Church of Carthage (central North Africa) lists saint…
#> 2 E06072 The Life of *Hilary of Arles (Hilary/Hilarius, bishop of Arles, ob. 4…
#> 3 E06267 The Lives of the Abbots of Agaune *Hymnemodus, Ambrosius, Achivus, Tr…
#> 4 E06268 The Life of *Aglius (abbot of Rebais, ob. c. 650, $S02631) is written…
#> 5 E06269 The Life of *Amandus (missionary, monastic founder and bishop of Maas…
#> 6 E06270 The Martyrdom of *Andeolus the Subdeacon (martyr of Viviers, $S02362)…
#> 7 E06271 The Life of *Aper (hermit of Grenoble, $S02362) is written in Latin i…
#> 8 E06272 The Life of *Aper (bishop of Toul, ob. 6th c., $S02195) is written in…
#> 9 E06276 The Life of *Avitus (abbot of La Perche, ob. c. 525, $S01307) is writ…
#> 10 E06277 The Life of *Avitus (bishop of Vienne, ob. 519, $S01894) is written i…
#> # … with 1,200 more rowshttps://stackoverflow.com/questions/72593129
复制相似问题