文章/答案/技术大牛

发布

社区首页 >问答首页 >“‘NA”在当前工作目录中不存在(使用for循环进行not抓取)

问“‘NA”在当前工作目录中不存在(使用for循环进行not抓取)
EN

Stack Overflow用户

提问于 2021-01-27 16:30:16

回答 2查看 200关注 0票数 0

我正在尝试从这个网页(https://de.wikipedia.org/wiki/Liste_der_Orte_mit_Stolpersteinen#Deutschland)上从德国所有城市的表格中抓取数据。在前5步中，我获得了所有城市的urls，这很好用。

library(tidyverse)
library(rvest)
library(rlist)
library(stringi)
library(htmltab)
library(foreign)

#1 url Germany 
url = "https://de.wikipedia.org/wiki/Liste_der_Orte_mit_Stolpersteinen#Deutschland"

#2 get url endings of all Cities
city_urls = url %>%
  read_html() %>%
  html_nodes(xpath = '//td[7]/a') %>% 
  html_attr("title")

#3 subset German url endings 
city_urls = as.data.frame(city_urls[19:1013])

#4 concatenate url start and endings
URLs_germany = c()
for (cities in city_urls) {      
  URLs_germany <- paste0('https://de.wikipedia.org/wiki/', cities, '') 
}

#5 correction of urls -> add missing "_" between the words 
Stolpersteine_cities = as.factor(str_replace_all(URLs_germany, " ", "_"))

问题出现在第6步。使用这个for循环，我想要从各个页面以及地理数据中获取所有数据。如果我执行它，我得到错误“NA不存在于当前工作目录中”。我已经看过stackoverflow (Error: 'NA' does not exist in current working directory (Webscraping))上的相关页面，但我无法将提到的解决方案应用到我的案例中。

#6 loop through all 
for (i in Stolpersteine_cities) {
  
  city <- read_html(i)
  
  sample <- city %>%
    html_node(xpath = '//*[@id="mw-content-text"]/div/table') %>% 
    html_table()
  
  #find geolocation
  geo_link <- city %>%
    html_node(xpath = '//*[@text()="Standort"]')%>% 
    html_attr("href")
  
  geo_links <- city %>%
    html_nodes("table") %>% 
    # html_nodes("thead") %>% 
    html_nodes("tbody") %>% 
    html_nodes("tr") %>% html_nodes("td") %>% 
    html_nodes("small") %>%
    html_nodes("a") %>%
    html_attr("href")
    
  long_lat_list <- vector("list", nrow(sample)) 
  #find geo location
  for(k in 1:length(geo_links)){
    
    geo_info <- read_html(geo_links[k])
    
    lat <- geo_info%>%
      html_node(xpath = '//span[@class="latitude"]')%>%
      html_text()
    
    long <- geo_info%>%
      html_node(xpath = '//*[@class="longitude"]')%>%
      html_text()
    
    long_lat_list[[k]] <- list(latitude=lat, longitude=long)
    
  }
  
  sample$latitude <- lapply(long_lat_list, "[[", 1)
  sample$longitude <- lapply(long_lat_list, "[[", 2)
  
  #Save City X 
  saveRDS(sample, "filename.Rds")
  
}

然后，我尝试仅使用前4个城市/urls执行for循环。虽然前两个url可以工作，但第三个url会导致上述错误。但是我在维基百科的表格中找不到任何不同之处，我也不知道问题出在哪里。

如果您能提供任何帮助，我将不胜感激。

web-scraping

回答 2

Stack Overflow用户

发布于 2021-01-27 23:43:15

geo_info <- read_html(geo_links[k])行出现错误。问题是geo_links是空的。因此，当您执行1:length(geo_links)时，它将返回向量c(1, 0)并进入for for循环。

然后，在geo_info <- read_html(geo_links[k])中，它尝试访问向量geo_links的第一个元素。因为向量是空的，所以它返回NA。当read_html尝试读取此文件时，它会返回此错误消息(我认为它正在尝试读取工作目录中的“url”NA )。

因此，您应该测试geo_links的长度，并且只在length(geo_links) > 0的情况下输入for循环。

  if (length(geo_links) > 0) {
    for(k in 1:length(geo_links)){
      
      geo_info <- read_html(geo_links[k])
      
      lat <- geo_info%>%
        html_node(xpath = '//span[@class="latitude"]')%>%
        html_text()
      
      long <- geo_info%>%
        html_node(xpath = '//*[@class="longitude"]')%>%
        html_text()
      
      long_lat_list[[k]] <- list(latitude=lat, longitude=long)
      
    }
    
    sample$latitude <- lapply(long_lat_list, "[[", 1)
    sample$longitude <- lapply(long_lat_list, "[[", 2)
  }

在其中一些链接中出现空列表的原因是因为不同链接之间的表并不完全相同。

您可以在标记为"small“的节点中查找地理位置数据。它在前两个中起作用，但在第三个中不起作用。在第三个节点中，没有“小”节点，地理位置数据以不同的方式进行标记。

票数 0

Stack Overflow用户

发布于 2021-01-28 00:34:53

这只是一个快速的概念证明，您也可以使用R从OpenStreetMap数据库中获取相同的数据。由于我是第一次尝试，因此还没有完全解决问题，但是应该可以让您开始使用它，而且在我看来是一种更好的方法，因为它的数据结构更好。

library(tidyverse)
library(osmdata)
library(data.table) 


q <- getbb("Baden-Baden") %>%
  opq() %>%
  add_osm_feature(key = "historic",
                  value = "memorial") 
str(q) 

result <- osmdata_sf(q)

dt <- result$osm_points %>% data.table() 
  
dt[memorial.type == "stolperstein",]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65915412

复制

相似问题

问“‘NA”在当前工作目录中不存在(使用for循环进行not抓取)
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问“‘NA”在当前工作目录中不存在(使用for循环进行not抓取)EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问“‘NA”在当前工作目录中不存在(使用for循环进行not抓取)
EN