首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用for循环进行web抓取-不能“传递”某些数据。

使用for循环进行web抓取-不能“传递”某些数据。
EN

Stack Overflow用户
提问于 2019-08-16 09:34:27
回答 1查看 65关注 0票数 0

下面的代码应该用来刮除评等和评等发布的日期。

这里的问题是,员工回答负面评论,他们的职位日期也被刮去。所以当我浏览网站的时候,有一个不均衡的评分和日期(20个评级和24个日期),因为其中四个日期属于员工给出的答案。

在代码中,每当出现“ugc-品牌响应”类时,我都会尝试“传递”,这是针对员工回答的。因此,如果不满足ugc类的要求,则"pass“,如果不是继续的话--但是没有存储任何数据。即使是最初的几次评论。

我从阅读别人的问题和答案中学到了很多东西。谢谢你这个很棒的社区。

代码语言:javascript
复制
import requests
import time
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"

url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')


rating_n_date=[] 

for rating in soup.find_all(attrs={"class": "c-review-average"}):     
    rating_n_date.append(rating .text)
for date in soup.findAll(attrs={"class":"submission-date"}):
    if "class" == "ugc-brand-response" in date:
        pass    
    else:
        continue
    rating_n_date.append(date.text)
time.sleep(2)
print(rating_n_date)

以下是数据,包括:

代码语言:javascript
复制
<li class="review-item" tabindex="-1"><div class="row"><div class="hidden-xs hidden-sm col-md-3"><div class="undefined ugc-author v-fw-medium body-copy-lg">Jimmy</div><ul class=" ugc-badge-list"><li class="visible-xs-inline-block visible-sm-inline-block visible-md-block visible-lg-block"><span class="c-overlay-wrapper"><span class="overlayTrigger"><button aria-expanded="false" aria-controls="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" aria-owns="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" data-track="Custom"><div class="ugc-my-bby-badge"><img alt="My Best Buy® Member" src="https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-mybby-core.svg"></div></button></span><span></span></span></li></ul></div><div class="col-xs-12 col-md-9"><div class="c-ratings-reviews v-medium"><p class="sr-only">Rating: 2 out of 5 stars</p><span class="c-stars c-stars-medium" alt="40%" aria-hidden="true"><span class="unfilled"></span><span class="filled" style="width:40%"></span></span><span class="c-reviews"><span class="c-review-average" aria-hidden="true">2</span></span></div><h3 id="review-id-bf28b82b-76f5-3c85-897e-598a91bbd8a8" class="ugc-review-title c-section-title heading-5 v-fw-medium  ">A disappointment: low volum, weak bass, distorts</h3><div class="disclaimer">Posted <time class="submission-date" title="Apr 28, 2019 11:29 PM">3 months ago</time></div>

这是我不想要的代码-雇员的回答:

代码语言:javascript
复制
<ul class="ugc-brand-response-list"><li><div class="row"><div class="col-sm-12 col-md-9 col-md-offset-3"><div class="ugc-brand-response"><h4 class=" c-section-title body-copy-lg v-fw-medium  ">Brand response</h4><p class="body-copy-lg">Jabra</p><div class="disclaimer"><time class="submission-date" title="Apr 29, 2019 8:46 AM">3 months ago</time></div><div class="ugc-brand-response-body body-copy-lg"><p class="pre-white-space">
Hello Jimmy - We were sorry to learn that the Jabra Elite 85h did not meet your expectations.  As the Elite 85h is a relatively new product, it is very important that you update the firmware in the headphones as often as necessary to keep up-to-date.  We are constantly improving all aspects of the Elite 85h through firmware updates.  If you have any specific questions or concers, we invite you to contact us directly by completing the web form at&nbsp;<a href="https://www.jabra.com/ServiceMenu/contact/ContactJabraSupport/ContactJabraSupportConsumer" target="_blank" rel="nofollow noopener noreferrer" style="word-break: break-all;">https://www.jabra.com/ServiceMenu/contact/ContactJabraSupport/ContactJabraSupportConsumer</a>, or by giving us a call - we love to help!  Thank you.
<img src="https://s3.amazonaws.com/stratos-logos/logos/Jabra.jpg" alt="Jabra" title="Jabra" style="display: block !important; margin-top: 2em !important; border: 1px solid #ccc !important; padding: 2px !important; background-color: white !important;">
<!--[if ReviewResponse]><![endif]--></p></div></div></div></div></li></ul>
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-16 11:37:56

它永远不会跳过类是ugc-brand-response-list的位置,因为您使用类属性submission-date显式地提取所有内容。

您还误解了continue。当您使用continue时,它并不意味着您认为它应该意味着什么,即“继续使用代码”。它真正的意思是,“停在这里,不要跟着剩下的循环,转到下一个项目。”因此,按照代码中的方式,当它找不到class == "ugc-brand-response"时,它会转到else,上面写着continue。因此,它永远不会附加到列表中,这就是为什么您的数据没有被存储/追加。

您可以做的是,转到父标记并提取类属性"col-xs-12 col-md-9"找到的整个评审“块”,然后从那里进入其中,使用find一起提取评等和提交日期(find将获得您要查找的内容的第一次出现,这意味着它不会获取员工回复的日期),然后将其存储到列表中。然后我就把它扔到了一个数据表里。

代码语言:javascript
复制
import requests
import time
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"

url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')


rating_list = [] 
date_list = []

for ratings in soup.find_all(attrs={"class": "col-xs-12 col-md-9"}):     
    rating = ratings.find('span', {'class':'c-review-average'}).text
    submission_date = ratings.find('time', {'class':'submission-date'}).text

    rating_list.append(rating)
    date_list.append(submission_date)


data = {'Rating':rating_list, 'Date':date_list}
df = pd.DataFrame(data)

输出:

代码语言:javascript
复制
print (df)
   Rating          Date
0       5  3 months ago
1       5  2 months ago
2       4  3 months ago
3       3  3 months ago
4       4  3 months ago
5       4  3 months ago
6       5  3 months ago
7       4  3 months ago
8       5  3 months ago
9       4    1 week ago
10      4  3 months ago
11      2  3 months ago
12      5  3 months ago
13      4     1 day ago
14      4   1 month ago
15      4  3 months ago
16      4   3 weeks ago
17      2  3 months ago
18      3   3 weeks ago
19      5  3 months ago
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57522231

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档