首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何获得所有属性的一致长度以及与详细信息相比的正确信息?

如何获得所有属性的一致长度以及与详细信息相比的正确信息?
EN

Stack Overflow用户
提问于 2022-03-17 09:45:18
回答 1查看 41关注 0票数 1

如何获得所有属性的一致长度以及与详细信息相比的正确信息。虽然我能够创建一个DataFrame,但我必须使长度保持一致,这使得细节不一致

代码语言:javascript
复制
    from urllib.request import urlopen
    from bs4 import BeautifulSoup as soup
    import pandas as pd
    
    url = "https://www.amazon.in/s?k=smart+watch&page=1"
    
    title = []
    stars =[]
    rating=[]
    list_price = []
    original_price=[]
    url_list =[] 
    
    def getdata (url):
        amazon_data = urlopen(url)
        amazon_html = amazon_data.read()
        a_soup = soup(amazon_html,'html.parser')
        all_title = a_soup.findAll('span',{'class':'a-size-medium a-color-base a-text-normal'})
        all_title = [t.text.split(">") for t in all_title]
        for item in all_title:
            title.append(item)
            
        all_stars = a_soup.findAll('span',{'class':'a-icon-alt'})
        all_stars = [r.text.split('>') for r in all_stars[:-4]]            
        for item in all_stars:
            stars.append(item) 
            
        all_rating = a_soup.findAll('div',{'class':'a-row a-size-small'})   
        all_rating = [r.text.split('>') for r in all_rating]
        for item in all_rating:
            rating.append(item)
            
        all_list_price = a_soup.findAll('span',{'class':'a-price-whole'})
        all_list_price = [r.text.split('>') for r in all_list_price]
        for item in all_list_price:
            list_price.append(item)
            
        
        all_original_price = a_soup.findAll('span',{'class':'a-price a-text-price'})
        all_original_price = [o.find('span', {'class': 'a-offscreen'}).text.split('>') for o in all_original_price]
        for item in all_original_price:
            original_price.append(item)
        return a_soup
        
        
    def getnextpage(a_soup):
        page= a_soup.find('a',attrs={"class": 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator'})
        page = page['href']
        url =  'http://www.amazon.in'+ str(page)
        return url
            
    while True:
        geturl = getdata(url)
        url = getnextpage(geturl)
        url_list.append(url)
        if not url:
            break
        print(url)
    
       

****OUTPUT****
http://www.amazon.in/smart-watch/s?k=smart+watch&page=2
http://www.amazon.in/smart-watch/s?k=smart+watch&page=3
http://www.amazon.in/smart-watch/s?k=smart+watch&page=4
http://www.amazon.in/smart-watch/s?k=smart+watch&page=5
http://www.amazon.in/smart-watch/s?k=smart+watch&page=6
http://www.amazon.in/smart-watch/s?k=smart+watch&page=7
http://www.amazon.in/smart-watch/s?k=smart+watch&page=8
http://www.amazon.in/smart-watch/s?k=smart+watch&page=9
http://www.amazon.in/smart-watch/s?k=smart+watch&page=10
http://www.amazon.in/smart-watch/s?k=smart+watch&page=11
http://www.amazon.in/smart-watch/s?k=smart+watch&page=12
http://www.amazon.in/smart-watch/s?k=smart+watch&page=13
http://www.amazon.in/smart-watch/s?k=smart+watch&page=14
http://www.amazon.in/smart-watch/s?k=smart+watch&page=15
http://www.amazon.in/smart-watch/s?k=smart+watch&page=16
http://www.amazon.in/smart-watch/s?k=smart+watch&page=17
http://www.amazon.in/smart-watch/s?k=smart+watch&page=18
http://www.amazon.in/smart-watch/s?k=smart+watch&page=19
http://www.amazon.in/smart-watch/s?k=smart+watch&page=20


**The length is not the same for all the attributes  

len(标题) 306 len(星) 286 len(评级) 286 len(list_price) 306 len(original_price) 306**

代码语言:javascript
复制
**Only when I make the length consistent, I am able to create the dataframe, but the problem is that the information is inconsistent **

    title = title[:-20]
    
    list_price = list_price[:-20]
    
    original_price = original_price[:-20]
    
    df = pd.DataFrame({'Title': title, 'Stars': stars, 'Rating':rating, 'List_Price': list_price, 'Original_Price':original_price})
EN

回答 1

Stack Overflow用户

发布于 2022-03-17 10:23:18

更改策略以保持一致的信息。不要摘取所有的标题,所有的明星,所有的收视率,.在一页纸上。我认为你应该为每一项提取数据:

代码语言:javascript
复制
data = []

def get_data(url)
    ...

    for item in a_soup.find_all('div', {'class': 's-result-item'}):
        if 's-widget' in item['class']:
            continue
        # extract information for each item
        title = ...
        stars = ...
        rating = ...
        price = ...
        original = ...
        data.append({'Title': title, 'Stars': stars, 'Rating': rating,
                     'List_Price': price, 'Original_Price': original})


df = pd.DataFrame(data)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71510135

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档