如何获得所有属性的一致长度以及与详细信息相比的正确信息。虽然我能够创建一个DataFrame,但我必须使长度保持一致,这使得细节不一致
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
url = "https://www.amazon.in/s?k=smart+watch&page=1"
title = []
stars =[]
rating=[]
list_price = []
original_price=[]
url_list =[]
def getdata (url):
amazon_data = urlopen(url)
amazon_html = amazon_data.read()
a_soup = soup(amazon_html,'html.parser')
all_title = a_soup.findAll('span',{'class':'a-size-medium a-color-base a-text-normal'})
all_title = [t.text.split(">") for t in all_title]
for item in all_title:
title.append(item)
all_stars = a_soup.findAll('span',{'class':'a-icon-alt'})
all_stars = [r.text.split('>') for r in all_stars[:-4]]
for item in all_stars:
stars.append(item)
all_rating = a_soup.findAll('div',{'class':'a-row a-size-small'})
all_rating = [r.text.split('>') for r in all_rating]
for item in all_rating:
rating.append(item)
all_list_price = a_soup.findAll('span',{'class':'a-price-whole'})
all_list_price = [r.text.split('>') for r in all_list_price]
for item in all_list_price:
list_price.append(item)
all_original_price = a_soup.findAll('span',{'class':'a-price a-text-price'})
all_original_price = [o.find('span', {'class': 'a-offscreen'}).text.split('>') for o in all_original_price]
for item in all_original_price:
original_price.append(item)
return a_soup
def getnextpage(a_soup):
page= a_soup.find('a',attrs={"class": 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator'})
page = page['href']
url = 'http://www.amazon.in'+ str(page)
return url
while True:
geturl = getdata(url)
url = getnextpage(geturl)
url_list.append(url)
if not url:
break
print(url)
****OUTPUT****
http://www.amazon.in/smart-watch/s?k=smart+watch&page=2
http://www.amazon.in/smart-watch/s?k=smart+watch&page=3
http://www.amazon.in/smart-watch/s?k=smart+watch&page=4
http://www.amazon.in/smart-watch/s?k=smart+watch&page=5
http://www.amazon.in/smart-watch/s?k=smart+watch&page=6
http://www.amazon.in/smart-watch/s?k=smart+watch&page=7
http://www.amazon.in/smart-watch/s?k=smart+watch&page=8
http://www.amazon.in/smart-watch/s?k=smart+watch&page=9
http://www.amazon.in/smart-watch/s?k=smart+watch&page=10
http://www.amazon.in/smart-watch/s?k=smart+watch&page=11
http://www.amazon.in/smart-watch/s?k=smart+watch&page=12
http://www.amazon.in/smart-watch/s?k=smart+watch&page=13
http://www.amazon.in/smart-watch/s?k=smart+watch&page=14
http://www.amazon.in/smart-watch/s?k=smart+watch&page=15
http://www.amazon.in/smart-watch/s?k=smart+watch&page=16
http://www.amazon.in/smart-watch/s?k=smart+watch&page=17
http://www.amazon.in/smart-watch/s?k=smart+watch&page=18
http://www.amazon.in/smart-watch/s?k=smart+watch&page=19
http://www.amazon.in/smart-watch/s?k=smart+watch&page=20
**The length is not the same for all the attributes len(标题) 306 len(星) 286 len(评级) 286 len(list_price) 306 len(original_price) 306**
**Only when I make the length consistent, I am able to create the dataframe, but the problem is that the information is inconsistent **
title = title[:-20]
list_price = list_price[:-20]
original_price = original_price[:-20]
df = pd.DataFrame({'Title': title, 'Stars': stars, 'Rating':rating, 'List_Price': list_price, 'Original_Price':original_price})发布于 2022-03-17 10:23:18
更改策略以保持一致的信息。不要摘取所有的标题,所有的明星,所有的收视率,.在一页纸上。我认为你应该为每一项提取数据:
data = []
def get_data(url)
...
for item in a_soup.find_all('div', {'class': 's-result-item'}):
if 's-widget' in item['class']:
continue
# extract information for each item
title = ...
stars = ...
rating = ...
price = ...
original = ...
data.append({'Title': title, 'Stars': stars, 'Rating': rating,
'List_Price': price, 'Original_Price': original})
df = pd.DataFrame(data)https://stackoverflow.com/questions/71510135
复制相似问题