首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >当筛选器不更改URL时刮除网页

当筛选器不更改URL时刮除网页
EN

Stack Overflow用户
提问于 2022-06-12 14:33:22
回答 2查看 176关注 0票数 0

我想在应用过滤器之后刮掉http://csla.history.ox.ac.uk/search.php,如下所示

点击'Saint'

  • selecting 'Gaul and Frankish王国‘Birth/Burial'

  • clicking of Birth/Burial’

  • clicking on 'Apply Search'

下的

我挣扎,因为URL没有得到相应的更新。

带有<option value="Gaul">Gaul and Frankish kingdoms</option>的源代码如下所示

代码语言:javascript
复制
<div class="section colm colm6" id="fl-page4-12">
<label for="item_12"class="field-label">Region of Birth/Burial</label>
<label class="field select">
<select id="text-nine" name="form[item_89]">
<option value=""></option>
<option value="East">'The East' (unspecified)</option>
<option value="West">'The West' (unspecified)</option>
<option value="Britain">Britain and Ireland</option>
<option value="Gaul">Gaul and Frankish kingdoms</option>

从选定的网页中,我想访问用蓝色编写的ID,即第一个ID是E06478

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-06-12 15:25:56

这是个棘手的问题。您需要将查询POST到服务器,并且查询需要以非常特殊的格式进行。您可以像这样从页面中获得html:

代码语言:javascript
复制
library(httr)
library(rvest)

items <- c(998, 1, 18,89, 90, 2, 88, 20, 3, 4, 5, 6, 12, 13, 11, 999, 213, 214)
contents <- c('\nE\n', '\n\n', '\n\n', '\nGaul\n', rep('\n\n', 11), '\nOr\n',
              '\n\n', '\n\n')
s <- paste0("-----------------------------39565121210000504382566389445\n",
       "Content-Disposition: form-data; name=\"form[item_", items,
       ']\"\n', contents,
       collapse = '')
s <- paste0(s, '-----------------------------39565121210000504382566389445--')

type <- paste0('multipart/form-data; boundary=---------------------------',
               '39565121210000504382566389445')

res <- POST('http://csla.history.ox.ac.uk/results.php',
           body = charToRaw(s),
           content_type(type))

要在一个整洁的数据框架中获得所有结果,您可以这样做:

代码语言:javascript
复制
df <- res %>% 
  read_html() %>% 
  html_elements(xpath = "//td[not(contains(@style, 'LightGray'))]") %>% 
  html_text() %>% 
  matrix(ncol = 2, byrow = TRUE) %>% 
  as.data.frame() %>% 
  setNames(c('ID', 'Title')) %>% 
  dplyr::as_tibble()

这将获得数据框架中的所有引用ID。要获取实际页面,我们使用这些查询字符串:

代码语言:javascript
复制
urls <- paste0("http://csla.history.ox.ac.uk/record.php?recid=", df$ID)

现在我们需要遍历所有的900+页面来提取表格数据。在循环中这样做是最安全的,然后在最后将列表绑定在一起:

代码语言:javascript
复制
all_results <- list()

for(i in seq_along(urls)) {
  all_results[[i]] <- read_html(urls[i]) %>% 
                       html_elements("td") %>% 
                       html_text() %>%
                       matrix(ncol = 4, byrow = TRUE) %>%
                       as.data.frame() %>%
                       setNames(c("ID", "Name", "Name_in_source", "Identity"))
}

final_result <- dplyr::bind_rows(all_results)

最后的结果是现在有超过3000行的数据帧。以下是前3条:

代码语言:javascript
复制
head(final_result, 3)
#>       ID                                       Name Name_in_source Identity
#> 1 S01319          Orientius, bishop of Auch, 5th c.                 Certain
#> 2 S02351 Mamertus, bishop of Vienne (Gaul), ob. 475                 Certain
#> 3 S00316                            Martyrs of Lyon                 Certain

有些ID是重复的,因为它们出现在多个页面中。您可以使用unique删除这些内容。还请注意,当您将数据帧打印到控制台时,希腊字母将显示为Unicode转义序列。然而,文本仍然存在于底层向量中。例如:

代码语言:javascript
复制
head(final_result[3])
#>                                                       Name_in_source
#> 1                                                                   
#> 2                                                                   
#> 3                                                                   
#> 4                                                                   
#> 5 <U+03A0><U+03BF><U+03BB><U+03CD><U+03BA>a<U+03C1>p<U+03BF><U+03C2>
#> 6           <U+03A0><U+03B9><U+03CC><U+03BD><U+03B9><U+03BF><U+03C2>

代码语言:javascript
复制
final_result[1:6, 3]
#> [1] ""          ""          ""          ""          "Πολύκαρπος" "Πιόνιος"  
票数 2
EN

Stack Overflow用户

发布于 2022-06-15 18:51:15

仅作为参考,httr (或至少httr2)知道如何发布multipart/form-data,因此处理这些表单就不那么可怕了:

代码语言:javascript
复制
library(rvest)
library(httr2)
# multipart/form-data POST with httr2
request("http://csla.history.ox.ac.uk/results.php") %>%
  req_body_multipart(
    `form[item_998]` = "E",
    `form[item_89]`  = "Gaul",
    `form[item_999]` = "Or"
  ) %>%
  req_perform() %>%
  resp_body_string() %>% 
  # table html is broken, fix rows:
  gsub("</tr></tr>", "</tr><tr>", .) %>% 
  minimal_html() %>% 
  html_element("table") %>%
  html_table()

#> # A tibble: 1,210 × 2
#>    ID     Title                                                                 
#>    <chr>  <chr>                                                                 
#>  1 E02204 Calendar of the Church of Carthage (central North Africa) lists saint…
#>  2 E06072 The Life of *Hilary of Arles (Hilary/Hilarius, bishop of Arles, ob. 4…
#>  3 E06267 The Lives of the Abbots of Agaune *Hymnemodus, Ambrosius, Achivus, Tr…
#>  4 E06268 The Life of *Aglius (abbot of Rebais, ob. c. 650, $S02631) is written…
#>  5 E06269 The Life of *Amandus (missionary, monastic founder and bishop of Maas…
#>  6 E06270 The Martyrdom of *Andeolus the Subdeacon (martyr of Viviers, $S02362)…
#>  7 E06271 The Life of *Aper (hermit of Grenoble, $S02362) is written in Latin i…
#>  8 E06272 The Life of *Aper (bishop of Toul, ob. 6th c., $S02195) is written in…
#>  9 E06276 The Life of *Avitus (abbot of La Perche, ob. c. 525, $S01307) is writ…
#> 10 E06277 The Life of *Avitus (bishop of Vienne, ob. 519, $S01894) is written i…
#> # … with 1,200 more rows
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72593129

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档