我试着把每一位评论员给出的产品的明星数进行统计。我注意到一些评论员是"Vine Voice“或付费评论员。他们很少给出4颗星,大部分是5颗星。因此,我想把他们排除在外。
我做这件事的方法是,如果一个评论带有“a彩色-成功”、a-文本粗体标记,则将其标记为“付费”或“无报酬”。
我似乎不能在藤蔓变量中附加任何“付费”标签。怎么会这样?
只有那些由藤蔓之声撰写的评论才有标签,那些没有标签的人在“付费”中没有标签。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
rating_list = []
date_list = []
vine = []
for num in range(1,12):
url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for ratings in soup.find_all("div", attrs={"data-hook": "review"}):
submission_date = ratings.find("span", {'data-hook':'review-date'}).text
rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})
if paid in ratings:
vine.append("Paid")
else:
vine.append("Not-paid")
date_list.append(submission_date)
rating_list.append(rating)
data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
time.sleep(2)
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)这就是我到目前为止所得到的。复习2和3是藤蔓之声,但他们被标记为没有-支付,但应该支付。
0 5.0 out of 5 stars 2019-09-18 Not-paid
1 4.0 out of 5 stars 2019-09-13 Not-paid
2 5.0 out of 5 stars 2019-09-12 Not-paid
3 5.0 out of 5 stars 2019-09-11 Not-paid
4 5.0 out of 5 stars 2019-09-10 Not-paid
...发布于 2019-09-26 09:21:40
将元素与元素进行比较,这就是为什么它总是达到其他条件的原因。我已经做了修改,并将文本与文本进行了比较,它正在运行fine.Check,下面的代码是这样的。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
rating_list = []
date_list = []
vine = []
for num in range(1,12):
url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for ratings in soup.find_all("div", attrs={"data-hook": "review"}):
submission_date = ratings.find("span", {'data-hook':'review-date'}).text
rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})
if paid:
if paid.text in ratings.text:
vine.append("Paid")
date_list.append(submission_date)
rating_list.append(rating)
data = {'Rating': rating_list, 'Date': date_list, "Paid": vine}
else:
vine.append("Not-paid")
date_list.append(submission_date)
rating_list.append(rating)
data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
time.sleep(2)
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)输出:
Date Paid Rating
0 2019-09-18 Not-paid 5.0 out of 5 stars
1 2019-09-13 Not-paid 4.0 out of 5 stars
2 2019-09-12 Paid 5.0 out of 5 stars
3 2019-09-11 Paid 5.0 out of 5 stars
4 2019-09-10 Not-paid 5.0 out of 5 stars
5 2019-09-10 Not-paid 2.0 out of 5 stars
6 2019-09-10 Paid 5.0 out of 5 stars
7 2019-09-09 Paid 5.0 out of 5 stars
8 2019-09-09 Not-paid 2.0 out of 5 stars
9 2019-09-08 Paid 5.0 out of 5 stars
10 2019-09-05 Paid 5.0 out of 5 stars
11 2019-09-01 Not-paid 2.0 out of 5 stars
12 2019-08-31 Paid 5.0 out of 5 stars
13 2019-08-25 Paid 5.0 out of 5 stars
14 2019-08-24 Not-paid 4.0 out of 5 stars
15 2019-08-22 Not-paid 5.0 out of 5 stars
16 2019-08-21 Paid 5.0 out of 5 stars
17 2019-08-20 Not-paid 5.0 out of 5 stars
18 2019-08-20 Paid 5.0 out of 5 stars
19 2019-08-18 Paid 5.0 out of 5 stars
20 2019-08-17 Not-paid 5.0 out of 5 stars
21 2019-08-17 Not-paid 5.0 out of 5 stars
22 2019-08-14 Not-paid 4.0 out of 5 stars
23 2019-08-12 Paid 5.0 out of 5 stars
24 2019-08-05 Paid 5.0 out of 5 stars
25 2019-08-05 Paid 4.0 out of 5 stars
26 2019-08-04 Paid 5.0 out of 5 stars
27 2019-08-04 Paid 4.0 out of 5 stars
29 2019-08-03 Paid 5.0 out of 5 stars
28 2019-08-03 Paid 4.0 out of 5 stars
.. ... ... ...
80 2019-07-08 Paid 5.0 out of 5 stars
81 2019-07-08 Paid 5.0 out of 5 stars
82 2019-07-08 Paid 5.0 out of 5 stars
85 2019-07-07 Paid 5.0 out of 5 stars
83 2019-07-07 Paid 5.0 out of 5 stars
84 2019-07-07 Paid 5.0 out of 5 stars
87 2019-07-06 Paid 5.0 out of 5 stars
86 2019-07-06 Paid 4.0 out of 5 stars
88 2019-07-05 Not-paid 4.0 out of 5 stars
89 2019-07-05 Paid 5.0 out of 5 stars
90 2019-07-05 Paid 5.0 out of 5 stars
91 2019-07-05 Paid 5.0 out of 5 stars
92 2019-07-04 Paid 5.0 out of 5 stars
93 2019-07-04 Paid 4.0 out of 5 stars
94 2019-07-04 Paid 5.0 out of 5 stars
95 2019-07-04 Paid 5.0 out of 5 stars
96 2019-07-04 Paid 5.0 out of 5 stars
98 2019-07-03 Not-paid 3.0 out of 5 stars
97 2019-07-03 Paid 5.0 out of 5 stars
99 2019-07-01 Paid 5.0 out of 5 stars
100 2019-07-01 Paid 3.0 out of 5 stars
101 2019-07-01 Paid 5.0 out of 5 stars
102 2019-06-30 Paid 5.0 out of 5 stars
103 2019-06-29 Paid 5.0 out of 5 stars
104 2019-06-29 Paid 5.0 out of 5 stars
105 2019-06-28 Not-paid 1.0 out of 5 stars
106 2019-06-27 Paid 4.0 out of 5 stars
107 2019-06-27 Paid 5.0 out of 5 stars
108 2019-06-26 Paid 5.0 out of 5 stars
109 2019-06-26 Paid 5.0 out of 5 stars
[110 rows x 3 columns]发布于 2019-09-26 12:17:12
我认为(使用bs4 4.7.1+)更好的方法是使用:has和:not进行前面的排除。这样就不需要排除字段/标志了。在下面的文章中,我将审阅者的名字打印出来作为一个视觉检查(您将看到付费的审阅者名称不会出现)。我还调整您的循环,使其正常工作,并使用Session提高效率。我还使用更短、更健壮的选择器。
css选择器比find更快,因此我可能会将find行更改为:
submission_date = review.select_one('[data-hook=review-date]').text
rating = review.select_one('[data-hook=review-star-rating]').textPy
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
rating_list = []
date_list = []
with requests.Session() as s:
for num in range(1,12):
url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for review in soup.select('.review:not(:has(.a-color-success))'):
submission_date = review.select_one('[data-hook=review-date]').text
rating = review.select_one('[data-hook=review-star-rating]').text
date_list.append(submission_date)
rating_list.append(rating)
print(review.select_one('.a-profile-name').text) #check
data = {'Rating':rating_list, 'Date':date_list}
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)https://stackoverflow.com/questions/58113028
复制相似问题