首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >rvest和rselenium提取动态下拉菜单项。

rvest和rselenium提取动态下拉菜单项。
EN

Stack Overflow用户
提问于 2022-05-31 20:46:58
回答 1查看 48关注 0票数 0

我试图从一个网站收集一些菜单信息,但我有点卡在正确提取下拉菜单项。

我想要下列项目:

等等,用于distritos页面上的每个下拉菜单。

然而,当我们到达centre badalona时,没有下拉菜单,所以没有什么可收集的。

例如,下面的代码可以获得以下输出:

代码语言:javascript
复制
> collectZonaPageSnapshot %>% 
+   html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked.re-GeographicSearchNext-checkboxItem--has-separator')
{xml_nodeset (9)}
[1] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Artigues - Llefià" href="/es/comprar/viviendas/badalona/artigues-llefia/l"><div class="sui-MoleculeCh ...
[2] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Bonavista - Bufalà - Morera" href="/es/comprar/viviendas/badalona/bonavista-bufala-morera/l"><div cla ...
[3] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Canyet - Pomar" href="/es/comprar/viviendas/badalona/canyet-pomar/l"><div class="sui-MoleculeCheckbox ...
[4] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Casagemes - Canyadó" href="/es/comprar/viviendas/badalona/casagemes-canyado/l"><div class="sui-Molecu ...
[5] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Centre Badalona" href="/es/comprar/viviendas/badalona/centre-badalona/l"><div class="sui-MoleculeChec ...
[6] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Gorg - Progrés" href="/es/comprar/viviendas/badalona/gorg-progres/l"><div class="sui-MoleculeCheckbox ...
[7] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Montigalà - Sant Crist" href="/es/comprar/viviendas/badalona/montigala-sant-crist/l"><div class="sui- ...
[8] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Port" href="/es/comprar/viviendas/badalona/port/l"><div class="sui-MoleculeCheckboxField" style=""><d ...
[9] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Salut - Lloreda" href="/es/comprar/viviendas/badalona/salut-lloreda/l"><div class="sui-MoleculeCheckb ...
> collectZonaPageSnapshot %>% 
+   html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked') 
{xml_nodeset (31)}
 [1] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Artigues - Llefià" href="/es/comprar/viviendas/badalona/artigues-llefia/l"><div class="sui-MoleculeC ...
 [2] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Artigues" href="/es/comprar/viviendas/badalona/artigues/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeFie ...
 [3] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Llefià" href="/es/comprar/viviendas/badalona/llefia/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField-- ...
 [4] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sant Roc" href="/es/comprar/viviendas/badalona/sant-roc/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeFie ...
 [5] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Bonavista - Bufalà - Morera" href="/es/comprar/viviendas/badalona/bonavista-bufala-morera/l"><div cl ...
 [6] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Bonavista" href="/es/comprar/viviendas/badalona/bonavista/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeF ...
 [7] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Bufalà" href="/es/comprar/viviendas/badalona/bufala/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField-- ...
 [8] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Morera" href="/es/comprar/viviendas/badalona/morera/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField-- ...
 [9] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Canyet - Pomar" href="/es/comprar/viviendas/badalona/canyet-pomar/l"><div class="sui-MoleculeCheckbo ...
[10] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Canyet" href="/es/comprar/viviendas/badalona/canyet/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField-- ...
[11] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Mas Ram" href="/es/comprar/viviendas/badalona/mas-ram/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField ...
[12] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Pomar" href="/es/comprar/viviendas/badalona/pomar/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField--in ...
[13] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Casagemes - Canyadó" href="/es/comprar/viviendas/badalona/casagemes-canyado/l"><div class="sui-Molec ...
[14] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Canyadó" href="/es/comprar/viviendas/badalona/canyado/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField ...
[15] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Casagemes" href="/es/comprar/viviendas/badalona/casagemes/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeF ...
[16] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Manresà" href="/es/comprar/viviendas/badalona/manresa/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField ...
[17] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Centre Badalona" href="/es/comprar/viviendas/badalona/centre-badalona/l"><div class="sui-MoleculeChe ...
[18] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Gorg - Progrés" href="/es/comprar/viviendas/badalona/gorg-progres/l"><div class="sui-MoleculeCheckbo ...
[19] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Congrés" href="/es/comprar/viviendas/badalona/congres/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeField ...
[20] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="El Remei" href="/es/comprar/viviendas/badalona/el-remei/l"><div class="sui-MoleculeCheckboxField" style=""><div class="sui-MoleculeField sui-MoleculeFie ...

第一部分给出了“父”菜单。第二部分给出了“父菜单”和“子菜单”,但我无法区分父菜单和子菜单。

预期产出:

能够提取具有与菜单页面类似结构的URL、名称等。

代码语言:javascript
复制
- Artigues - Llefía
-- Artigues
-- Llefía
-- Sant Roc

-Bonavista -Bufalà - Morera
-- Bonavista
-- Bufalà
-- Morera

-Canyet - Pomar
-- Canyet
-- Mas Ram
-- Pomar

等等(目前我只能以“非树”格式获得它们,也就是说,我无法分辨哪个是父菜单,哪个是子菜单)

代码语言:javascript
复制
> collectZonaPageSnapshot %>% 
+   html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked') %>% 
+   html_text()
 [1] "Artigues - Llefià851"           "Artigues38"                     "Llefià714"                      "Sant Roc99"                     "Bonavista - Bufalà - Morera233" "Bonavista34"                   
 [7] "Bufalà156"                      "Morera43"                       "Canyet - Pomar53"               "Canyet6"                        "Mas Ram29"                      "Pomar18"                       
[13] "Casagemes - Canyadó40"          "Canyadó9"                       "Casagemes29"                    "Manresà2"                       "Centre Badalona141"             "Gorg - Progrés267"             
[19] "Congrés32"                      "El Remei25"                     "Gorg69"                         "Progrés - Pep Ventura132"       "Montigalà - Sant Crist209"      "Montigalà21"                   
[25] "Puigfred86"                     "Sant Crist97"                   "Port79"                         "Salut - Lloreda592"             "La Salut399"                    "Lloreda133"                    
[31] "Sistrells60"

代码:

代码语言:javascript
复制
library(RSelenium)
library(rvest)
library(tidyverse)
distrito_url_to_get = "https://www.fotocasa.es/es/comprar/viviendas/badalona/todas-las-zonas/l"


rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
remDr$navigate(distrito_url_to_get)
remDr$maxWindowSize()
# click "Accept"
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()

# click each of the boxes to "activate the HTML page
#distritoDropDownElements = remDr$findElements(using = 'css selector', '.sui-MoleculeCheckboxField')
distritoDropDownToggleIconElements = remDr$findElements(using = 'css selector', '.sui-MoleculeCheckboxField-toggleIcon')
for(i in 1:length(distritoDropDownElements)){
  distritoDropDownElements[[i]]$clickElement()
}

# read in the HTML page
collectZonaPageSnapshot = remDr$getPageSource()[[1]] %>% 
  read_html()

# part 1) -collect the parent menus
collectZonaPageSnapshot %>% 
  html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked.re-GeographicSearchNext-checkboxItem--has-separator') 

# part 2) -collect the child menus
collectZonaPageSnapshot %>% 
  html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked') 
EN

回答 1

Stack Overflow用户

发布于 2022-06-01 00:09:07

您可以从脚本标记中提取它们,并返回为dataframe。对于此方法,您将针对每个相关的子级重复父级。

代码语言:javascript
复制
library(tidyverse)
library(rvest)
library(jsonlite)

extract_data <- function(x) {
  tibble(
    location = x$literal,
    sub_location = map(x$subLocations, "literal", pluck) %>% unlist()
  )
}

p <- read_html("https://www.fotocasa.es/es/comprar/viviendas/badalona/todas-las-zonas/l") %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)"')[, 2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
location_data <- data$initialSearch$result$geographicSearch[[4]]$items
df <- map_dfr(location_data, extract_data) 
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72453946

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档