文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Rstudio抓取Amazon评论时出错:参数表示不同的行数: 3，10

问使用Rstudio抓取Amazon评论时出错:参数表示不同的行数: 3，10
EN

Stack Overflow用户

提问于 2021-02-25 20:34:33

回答 1查看 120关注 0票数 0

嗨，我是一个编码的初学者，只有与事物如何工作的边际理解。

我目前正在尝试在Rstudio中使用rvest包来获取亚马逊的评论。

我的目标是收集10页的评论，每个评论对应400个产品id (ASIN)。

我使用的函数如下：

scrape_amazon <- function(url, throttle = 0){
  
  # Install / Load relevant packages
  if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
  pacman::p_load(RCurl, XML, dplyr, stringr, rvest, purrr)
  
  # Set throttle between URL calls
  sec = 0
  if(throttle < 0) warning("throttle was less than 0: set to 0")
  if(throttle > 0) sec = max(0, throttle + runif(1, -1, 1))
  
  # obtain HTML of URL
  doc <- read_html(url)
  
  # Parse relevant elements from HTML
  title <- doc %>%
    html_nodes("#cm_cr-review_list .a-color-base") %>%
    html_text()
  
  author <- doc %>%
    html_nodes("#cm_cr-review_list .a-profile-name") %>%
    html_text()
  
  date <- doc %>%
    html_nodes("#cm_cr-review_list .review-date") %>%
    html_text() %>% 
    gsub(".*on ", "", .)
  
  
  stars <- doc %>%
    html_nodes("#cm_cr-review_list  .review-rating") %>%
    html_text() %>%
    str_extract("\\d") %>%
    
    as.numeric() 
  
  comments <- doc %>%
    html_nodes("#cm_cr-review_list .review-text") %>%
    html_text() 
  
  n_helpful <- doc %>%
    html_nodes(".a-expander-inline-container") %>%
    html_text() 
  
  
  # Combine attributes into a single data frame
  n_helpful <- data.frame(n_helpful, stringsAsFactors = FALSE)
  n_helpful <- n_helpful[-1,]
  df2 <- data.frame(title, author, date, stars, comments, stringsAsFactors = FALSE)
  dff <- cbind(n_helpful, df2)
  return(dff)
}

然后我抓取了一个页面，以确保它可以正常工作：

url <- "http://www.amazon.com/product-reviews/B00836Y6B2/?pageNumber=1"
reviews <- scrape_amazon(url)

。。并确认了它是有效的。

然后，我将要抓取的页数设置为10，读取包含ASIN表的csv文件，创建一个空的数据帧，以便对每个ASIN的评论加起来就是空的数据帧。

运行以下代码成功返回ASIN 1到10的所有评论。

然而，当我尝试从11到20时，出现了以下错误消息。

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 3, 10

# Set # of pages to scrape. Note: each page contains 8 reviews.
pages <- 10


# loop over pages
Asins <- read.csv("ASINs.csv")

reviews_total <- data.frame()

for(prod_cod in Asins[1:10,]){ 
    for(page_num in 1:pages){
    url <- paste0("http://www.amazon.com/product-reviews/",prod_cod,"/?pageNumber=", page_num)
    reviews1 <- scrape_amazon(url, throttle = 4)
    reviews_total <-rbind(reviews_total, reviews1)}
  }

for(prod_cod in Asins[11:20,]){ 
  for(page_num in 1:pages){
    url <- paste0("http://www.amazon.com/product-reviews/",prod_cod,"/?pageNumber=", page_num)
    reviews1 <- scrape_amazon(url, throttle = 4)
    reviews_total <-rbind(reviews_total, reviews1)}
}

我现在很困惑，因为错误消息不一致。

我的代码基于此链接中的说明：https://justrthings.com/2019/03/03/web-scraping-amazon-reviews-march-2019/

但是很多代码对我来说都不起作用，所以我做了一点改动。

如果您在解决此问题时需要更多信息，请让我知道。

谢谢

web-scraping

amazon

rvest

回答 1

Stack Overflow用户

发布于 2021-08-11 10:59:10

我不是专家，但我有以下代码，可能会帮助你，我的目标是审查某些产品，存储在一个excel文件直接连接到几个数据库。它的名称将是"csv"

csv具有此概述：| Product_name | ASIN ||-|-|

| chr | chr || OWLURAL | B08F3ZN25P || ...多了348行

以下是我遵循的步骤：

创建一个全局df，其中存储我需要的信息

     <- data.frame()

还创建了一个变量，该变量将在脚本创建的每个循环中添加1，用于访问csv的每个文件，其中存储了

的所有产品名称和ASIN

    contador <- 1

我们必须为所有不同的ASIN做一个循环，并做另一个循环来访问每个评论页面，其中将有10个评论/页面。为此，我们将使用for(x in i){}。其中x是每次循环要更改的值，i是要提供给x的值。

让我们进入每个ASIN并知道它有多少评论，然后我们必须将它除以10 (评论/页面)，以确定我们必须在每个ASIN内检查多少页。

    for(j in csv$ASIN){
      url_2<-paste0("https://www.amazon.es/Kensington-1500109ES-Teclado-USB-cable/product-reviews/***",j,"***/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews")
      pagina_reviews2<-read_html(url_2)
     val_total = pagina_reviews2 %>%
        html_nodes(".a-color-secondary") %>%
        html_text()
      val_total = str_extract(val_total[1], "[0-9]*") / 10

假设所有ASINS至少有一次审查...我们创建空变量来存储我们想要从链接

获取信息

  review_bodies = c()
  review_titles = c()

现在，只需更改任何链接末尾的页码，我们就可以访问每个链接的每一个页面。

for(i in 1:val_total){
    url<-paste0("https://www.amazon.es/NGS-FORTRESS900V2-Sistema-alimentaci%C3%B3n-ininterrumpida/product-reviews/***",j,"***/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=***",i,"***")
  
    
    # Download each page
    pagina_reviews<-read_html(url)
    print(pagina_reviews)
    
    # We get all the review information, including if the pruchase was verified, different styles, colours, etc
    review_body_page <- pagina_reviews %>%
      html_nodes(".review-data") %>%
      html_text()
    
    review_bodies = c(review_bodies, review_body_page)
    review_bodies <- gsub("\\n","",review_bodies)
    review_bodies <- review_bodies[nchar(as.vector(review_bodies)) != 0]
    
    # The filter of reviews of each group of amazon categories
    review_bodies <- review_bodies[ !grepl("^Tama", review_bodies)]
    review_bodies <- review_bodies[ !grepl("^Potencia:", review_bodies)]
    review_bodies <- review_bodies[ !grepl("^Color:", review_bodies)]
        
    
    # We need the title now of each review
    review_title_page<- pagina_reviews %>%
      html_nodes(".a-text-bold span") %>%
      html_text()
    
    review_titles = c(review_titles, review_title_page)
    review_titles = review_titles[nchar(as.vector(review_titles)) != 0]

现在它将取决于您需要的用法，您可以在循环中捕获更多变量

现在，您可以将review_titles和review_bodies合并，并将它们存储在在beginign resultado_GLOBAL创建的df中

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66368577

复制

相似问题

问使用Rstudio抓取Amazon评论时出错:参数表示不同的行数: 3，10
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Rstudio抓取Amazon评论时出错:参数表示不同的行数: 3，10EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Rstudio抓取Amazon评论时出错:参数表示不同的行数: 3，10
EN