文章/答案/技术大牛

发布

社区首页 >问答首页 >如何获得所有属性的一致长度以及与详细信息相比的正确信息？

问如何获得所有属性的一致长度以及与详细信息相比的正确信息？
EN

Stack Overflow用户

提问于 2022-03-17 09:45:18

回答 1查看 41关注 0票数 1

如何获得所有属性的一致长度以及与详细信息相比的正确信息。虽然我能够创建一个DataFrame，但我必须使长度保持一致，这使得细节不一致

    from urllib.request import urlopen
    from bs4 import BeautifulSoup as soup
    import pandas as pd
    
    url = "https://www.amazon.in/s?k=smart+watch&page=1"
    
    title = []
    stars =[]
    rating=[]
    list_price = []
    original_price=[]
    url_list =[] 
    
    def getdata (url):
        amazon_data = urlopen(url)
        amazon_html = amazon_data.read()
        a_soup = soup(amazon_html,'html.parser')
        all_title = a_soup.findAll('span',{'class':'a-size-medium a-color-base a-text-normal'})
        all_title = [t.text.split(">") for t in all_title]
        for item in all_title:
            title.append(item)
            
        all_stars = a_soup.findAll('span',{'class':'a-icon-alt'})
        all_stars = [r.text.split('>') for r in all_stars[:-4]]            
        for item in all_stars:
            stars.append(item) 
            
        all_rating = a_soup.findAll('div',{'class':'a-row a-size-small'})   
        all_rating = [r.text.split('>') for r in all_rating]
        for item in all_rating:
            rating.append(item)
            
        all_list_price = a_soup.findAll('span',{'class':'a-price-whole'})
        all_list_price = [r.text.split('>') for r in all_list_price]
        for item in all_list_price:
            list_price.append(item)
            
        
        all_original_price = a_soup.findAll('span',{'class':'a-price a-text-price'})
        all_original_price = [o.find('span', {'class': 'a-offscreen'}).text.split('>') for o in all_original_price]
        for item in all_original_price:
            original_price.append(item)
        return a_soup
        
        
    def getnextpage(a_soup):
        page= a_soup.find('a',attrs={"class": 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator'})
        page = page['href']
        url =  'http://www.amazon.in'+ str(page)
        return url
            
    while True:
        geturl = getdata(url)
        url = getnextpage(geturl)
        url_list.append(url)
        if not url:
            break
        print(url)
    
       

****OUTPUT****
http://www.amazon.in/smart-watch/s?k=smart+watch&page=2
http://www.amazon.in/smart-watch/s?k=smart+watch&page=3
http://www.amazon.in/smart-watch/s?k=smart+watch&page=4
http://www.amazon.in/smart-watch/s?k=smart+watch&page=5
http://www.amazon.in/smart-watch/s?k=smart+watch&page=6
http://www.amazon.in/smart-watch/s?k=smart+watch&page=7
http://www.amazon.in/smart-watch/s?k=smart+watch&page=8
http://www.amazon.in/smart-watch/s?k=smart+watch&page=9
http://www.amazon.in/smart-watch/s?k=smart+watch&page=10
http://www.amazon.in/smart-watch/s?k=smart+watch&page=11
http://www.amazon.in/smart-watch/s?k=smart+watch&page=12
http://www.amazon.in/smart-watch/s?k=smart+watch&page=13
http://www.amazon.in/smart-watch/s?k=smart+watch&page=14
http://www.amazon.in/smart-watch/s?k=smart+watch&page=15
http://www.amazon.in/smart-watch/s?k=smart+watch&page=16
http://www.amazon.in/smart-watch/s?k=smart+watch&page=17
http://www.amazon.in/smart-watch/s?k=smart+watch&page=18
http://www.amazon.in/smart-watch/s?k=smart+watch&page=19
http://www.amazon.in/smart-watch/s?k=smart+watch&page=20


**The length is not the same for all the attributes

len(标题) 306 len(星) 286 len(评级) 286 len(list_price) 306 len(original_price) 306**

**Only when I make the length consistent, I am able to create the dataframe, but the problem is that the information is inconsistent **

    title = title[:-20]
    
    list_price = list_price[:-20]
    
    original_price = original_price[:-20]
    
    df = pd.DataFrame({'Title': title, 'Stars': stars, 'Rating':rating, 'List_Price': list_price, 'Original_Price':original_price})

python

html

pandas

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

发布于 2022-03-17 10:23:18

更改策略以保持一致的信息。不要摘取所有的标题，所有的明星，所有的收视率，.在一页纸上。我认为你应该为每一项提取数据：

data = []

def get_data(url)
    ...

    for item in a_soup.find_all('div', {'class': 's-result-item'}):
        if 's-widget' in item['class']:
            continue
        # extract information for each item
        title = ...
        stars = ...
        rating = ...
        price = ...
        original = ...
        data.append({'Title': title, 'Stars': stars, 'Rating': rating,
                     'List_Price': price, 'Original_Price': original})


df = pd.DataFrame(data)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71510135

复制

相似问题

问如何获得所有属性的一致长度以及与详细信息相比的正确信息？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获得所有属性的一致长度以及与详细信息相比的正确信息？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获得所有属性的一致长度以及与详细信息相比的正确信息？
EN