嗨,我是一个编码的初学者,只有与事物如何工作的边际理解。
我目前正在尝试在Rstudio中使用rvest包来获取亚马逊的评论。
我的目标是收集10页的评论,每个评论对应400个产品id (ASIN)。
我使用的函数如下:
scrape_amazon <- function(url, throttle = 0){
# Install / Load relevant packages
if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
pacman::p_load(RCurl, XML, dplyr, stringr, rvest, purrr)
# Set throttle between URL calls
sec = 0
if(throttle < 0) warning("throttle was less than 0: set to 0")
if(throttle > 0) sec = max(0, throttle + runif(1, -1, 1))
# obtain HTML of URL
doc <- read_html(url)
# Parse relevant elements from HTML
title <- doc %>%
html_nodes("#cm_cr-review_list .a-color-base") %>%
html_text()
author <- doc %>%
html_nodes("#cm_cr-review_list .a-profile-name") %>%
html_text()
date <- doc %>%
html_nodes("#cm_cr-review_list .review-date") %>%
html_text() %>%
gsub(".*on ", "", .)
stars <- doc %>%
html_nodes("#cm_cr-review_list .review-rating") %>%
html_text() %>%
str_extract("\\d") %>%
as.numeric()
comments <- doc %>%
html_nodes("#cm_cr-review_list .review-text") %>%
html_text()
n_helpful <- doc %>%
html_nodes(".a-expander-inline-container") %>%
html_text()
# Combine attributes into a single data frame
n_helpful <- data.frame(n_helpful, stringsAsFactors = FALSE)
n_helpful <- n_helpful[-1,]
df2 <- data.frame(title, author, date, stars, comments, stringsAsFactors = FALSE)
dff <- cbind(n_helpful, df2)
return(dff)
}然后我抓取了一个页面,以确保它可以正常工作:
url <- "http://www.amazon.com/product-reviews/B00836Y6B2/?pageNumber=1"
reviews <- scrape_amazon(url)。。并确认了它是有效的。
然后,我将要抓取的页数设置为10,读取包含ASIN表的csv文件,创建一个空的数据帧,以便对每个ASIN的评论加起来就是空的数据帧。
运行以下代码成功返回ASIN 1到10的所有评论。
然而,当我尝试从11到20时,出现了以下错误消息。
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 10 # Set # of pages to scrape. Note: each page contains 8 reviews.
pages <- 10
# loop over pages
Asins <- read.csv("ASINs.csv")
reviews_total <- data.frame()
for(prod_cod in Asins[1:10,]){
for(page_num in 1:pages){
url <- paste0("http://www.amazon.com/product-reviews/",prod_cod,"/?pageNumber=", page_num)
reviews1 <- scrape_amazon(url, throttle = 4)
reviews_total <-rbind(reviews_total, reviews1)}
}
for(prod_cod in Asins[11:20,]){
for(page_num in 1:pages){
url <- paste0("http://www.amazon.com/product-reviews/",prod_cod,"/?pageNumber=", page_num)
reviews1 <- scrape_amazon(url, throttle = 4)
reviews_total <-rbind(reviews_total, reviews1)}
}我现在很困惑,因为错误消息不一致。
我的代码基于此链接中的说明:https://justrthings.com/2019/03/03/web-scraping-amazon-reviews-march-2019/
但是很多代码对我来说都不起作用,所以我做了一点改动。
如果您在解决此问题时需要更多信息,请让我知道。
谢谢
发布于 2021-08-11 10:59:10
我不是专家,但我有以下代码,可能会帮助你,我的目标是审查某些产品,存储在一个excel文件直接连接到几个数据库。它的名称将是"csv"
csv具有此概述:| Product_name | ASIN ||-|-|
| chr | chr || OWLURAL | B08F3ZN25P || ...多了348行
以下是我遵循的步骤:
<- data.frame() 的所有产品名称和ASIN
contador <- 1 for(j in csv$ASIN){
url_2<-paste0("https://www.amazon.es/Kensington-1500109ES-Teclado-USB-cable/product-reviews/***",j,"***/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews")
pagina_reviews2<-read_html(url_2)
val_total = pagina_reviews2 %>%
html_nodes(".a-color-secondary") %>%
html_text()
val_total = str_extract(val_total[1], "[0-9]*") / 10获取信息
review_bodies = c()
review_titles = c()for(i in 1:val_total){
url<-paste0("https://www.amazon.es/NGS-FORTRESS900V2-Sistema-alimentaci%C3%B3n-ininterrumpida/product-reviews/***",j,"***/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=***",i,"***")
# Download each page
pagina_reviews<-read_html(url)
print(pagina_reviews)
# We get all the review information, including if the pruchase was verified, different styles, colours, etc
review_body_page <- pagina_reviews %>%
html_nodes(".review-data") %>%
html_text()
review_bodies = c(review_bodies, review_body_page)
review_bodies <- gsub("\\n","",review_bodies)
review_bodies <- review_bodies[nchar(as.vector(review_bodies)) != 0]
# The filter of reviews of each group of amazon categories
review_bodies <- review_bodies[ !grepl("^Tama", review_bodies)]
review_bodies <- review_bodies[ !grepl("^Potencia:", review_bodies)]
review_bodies <- review_bodies[ !grepl("^Color:", review_bodies)]
# We need the title now of each review
review_title_page<- pagina_reviews %>%
html_nodes(".a-text-bold span") %>%
html_text()
review_titles = c(review_titles, review_title_page)
review_titles = review_titles[nchar(as.vector(review_titles)) != 0]现在它将取决于您需要的用法,您可以在循环中捕获更多变量
现在,您可以将review_titles和review_bodies合并,并将它们存储在在beginign resultado_GLOBAL创建的df中
https://stackoverflow.com/questions/66368577
复制相似问题