文章/答案/技术大牛

发布

社区首页 >问答首页 >当筛选器不更改URL时刮除网页

问当筛选器不更改URL时刮除网页
EN

Stack Overflow用户

提问于 2022-06-12 14:33:22

回答 2查看 176关注 0票数 0

我想在应用过滤器之后刮掉http://csla.history.ox.ac.uk/search.php，如下所示

点击'Saint'

selecting 'Gaul and Frankish王国‘Birth/Burial'

clicking of Birth/Burial’

clicking on 'Apply Search'

下的

我挣扎，因为URL没有得到相应的更新。

带有<option value="Gaul">Gaul and Frankish kingdoms</option>的源代码如下所示

<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

从选定的网页中，我想访问用蓝色编写的ID，即第一个ID是E06478。

selenium

web-scraping

rvest

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-06-12 15:25:56

这是个棘手的问题。您需要将查询POST到服务器，并且查询需要以非常特殊的格式进行。您可以像这样从页面中获得html：

library(httr)
library(rvest)

items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
              '\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
       "Content-Disposition: form-data; name=\"form[item_", items,
       ']\"\n', contents,
       collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')

type <- paste0('multipart/form-data; boundary=---------------------------',
               '39565121210000504382566389445')

res <- POST('http://csla.history.ox.ac.uk/results.php',
           body = charToRaw(s),
           content_type(type))

要在一个整洁的数据框架中获得所有结果，您可以这样做：

df <- res %>% 
  read_html() %>% 
  html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>% 
  html_text() %>% 
  matrix(ncol = 2, byrow = TRUE) %>% 
  as.data.frame() %>% 
  setNames(c('ID', 'Title')) %>% 
  dplyr::as_tibble()

这将获得数据框架中的所有引用ID。要获取实际页面，我们使用这些查询字符串：

urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)

现在我们需要遍历所有的900+页面来提取表格数据。在循环中这样做是最安全的，然后在最后将列表绑定在一起：

all_results <- list()

for(i in seq_along(urls)) {
  all_results[[i]] <- read_html(urls[i]) %>% 
                       html_elements("td") %>% 
                       html_text() %>%
                       matrix(ncol = 4, byrow = TRUE) %>%
                       as.data.frame() %>%
                       setNames(c("ID", "Name", "Name_in_source", "Identity"))
}

final_result <- dplyr::bind_rows(all_results)

最后的结果是现在有超过3000行的数据帧。以下是前3条：

head(final_result, 3)
#>       ID                                       Name Name_in_source Identity
#> 1 S01319          Orientius, bishop of Auch, 5th c.                 Certain
#> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475                 Certain
#> 3 S00316                            Martyrs of Lyon                 Certain

有些ID是重复的，因为它们出现在多个页面中。您可以使用unique删除这些内容。还请注意，当您将数据帧打印到控制台时，希腊字母将显示为Unicode转义序列。然而，文本仍然存在于底层向量中。例如：

head(final_result[3])
#>                                                       Name_in_source
#> 1                                                                   
#> 2                                                                   
#> 3                                                                   
#> 4                                                                   
#> 5 <U+03A0><U+03BF><U+03BB><U+03CD><U+03BA>a<U+03C1>p<U+03BF><U+03C2>
#> 6           <U+03A0><U+03B9><U+03CC><U+03BD><U+03B9><U+03BF><U+03C2>

但

final_result[1:6, 3]
#> [1] ""          ""          ""          ""          "Πολύκαρπος" "Πιόνιος"

票数 2

Stack Overflow用户

发布于 2022-06-15 18:51:15

仅作为参考，httr (或至少httr2)知道如何发布multipart/form-data，因此处理这些表单就不那么可怕了：

library(rvest)
library(httr2)
# multipart/form-data POST with httr2
request("http://csla.history.ox.ac.uk/results.php") %>%
  req_body_multipart(
    `form[item_998]` = "E",
    `form[item_89]`  = "Gaul",
    `form[item_999]` = "Or"
  ) %>%
  req_perform() %>%
  resp_body_string() %>% 
  # table html is broken, fix rows:
  gsub("</tr></tr>", "</tr><tr>", .) %>% 
  minimal_html() %>% 
  html_element("table") %>%
  html_table()

#> # A tibble: 1,210 × 2
#>    ID     Title                                                                 
#>    <chr>  <chr>                                                                 
#>  1 E02204 Calendar of the Church of Carthage (central North Africa) lists saint…
#>  2 E06072 The Life of *Hilary of Arles (Hilary/Hilarius, bishop of Arles, ob. 4…
#>  3 E06267 The Lives of the Abbots of Agaune *Hymnemodus, Ambrosius, Achivus, Tr…
#>  4 E06268 The Life of *Aglius (abbot of Rebais, ob. c. 650, $S02631) is written…
#>  5 E06269 The Life of *Amandus (missionary, monastic founder and bishop of Maas…
#>  6 E06270 The Martyrdom of *Andeolus the Subdeacon (martyr of Viviers, $S02362)…
#>  7 E06271 The Life of *Aper (hermit of Grenoble, $S02362) is written in Latin i…
#>  8 E06272 The Life of *Aper (bishop of Toul, ob. 6th c., $S02195) is written in…
#>  9 E06276 The Life of *Avitus (abbot of La Perche, ob. c. 525, $S01307) is writ…
#> 10 E06277 The Life of *Avitus (bishop of Vienne, ob. 519, $S01894) is written i…
#> # … with 1,200 more rows

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72593129

复制

相似问题

问当筛选器不更改URL时刮除网页
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当筛选器不更改URL时刮除网页EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当筛选器不更改URL时刮除网页
EN