文章/答案/技术大牛

发布

社区首页 >问答首页 >整个页面都没有使用Beautiful Soup进行解析

问整个页面都没有使用Beautiful Soup进行解析
EN

Stack Overflow用户

提问于 2020-08-12 15:58:33

回答 1查看 46关注 0票数 0

我能够从我想要的网站上检索任何超过前19条记录的内容。考虑到站点上的清单是动态的，我相信这可能与我运行python代码时只返回前19个清单有关。我在网上读了一些东西，但还没有找到解决问题的办法。

Bellow是我的完整python代码。我很乐意得到社区的意见，让他们知道我可以做些什么来解决我的问题。

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import csv


headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.producthunt.com/time-travel/2019/1/7"
results = requests.get(url, headers=headers) 



soup = BeautifulSoup(results.text, "html.parser")



name = []
description = []
category = []
up_votes = []


ph_project_div = soup.find_all('div', class_='item_54fdd')



for container in ph_project_div:
    
    ph_name = container.a.find('h3', class_='font_9d927 medium_51d18 semiBold_e201b title_9ddaf lineHeight_042f1 underline_57d3c').text
    name.append(ph_name)

    ph_desc = container.a.find('p', class_='font_9d927 grey_bbe43 small_231df normal_d2e66 tagline_619b7 lineHeight_042f1 underline_57d3c').text
    description.append(ph_desc)
    
    
    ph_cat = container.find('span', class_='font_9d927 grey_bbe43 xSmall_1a46e lineHeight_042f1 underline_57d3c')
    category.append(ph_cat)
    
    ph_vote = container.find('span', class_='font_9d927 small_231df semiBold_e201b lineHeight_042f1 underline_57d3c')
    up_votes.append(ph_vote)
    


phunt = pd.DataFrame({
    'Product Name': name,
    'Product Description': description,
    'Product Category': category,
    'Product Votes': up_votes,
})



phunt.to_csv("ph_01_04_2019.csv")

python

web-scraping

screen-scraping

回答 1

Stack Overflow用户

发布于 2020-08-12 19:41:06

所以你是对的，javascript被用来获取下一组帖子。这是一个触发对服务器的HTTP请求的滚动事件的示例。

要查看这一点，您可以使用inspect页面--> Network tools --> XHR。清除所有请求并向下滚动。您将看到一个post请求正在发出，并且在右侧您将看到数据。我倾向于搜索一些我希望看到的东西，以确认数据是否存在。

有两种方法可以处理这类动态内容。

模仿HTTP request
Use browser activity to grab data

第一个总是更好的选择，如果有API点的可能性，等等。那么最好使用这个。数据的效率和准确性很重要。

第二种选择依赖于selenium这样的包，它们速度较慢，对HTML中的更改非常敏感，这对代码来说很烦人。

为了收集下面代码的信息，我复制了在网络工具的XHR部分中找到的请求的cURL。我使用的网站可以将其转换为python格式，例如curl.trillworks.com。

HTTP请求可能只需要一个简单的get请求，该请求指向端点的正确url，在您必须输入头/参数/数据之前，最好先这样做。在本例中，您获得了所有这三个对象，但实际上您只需要一个用户代理和您想要的响应对象的类型，以及用于引导您获得正确响应的数据。

代码示例

import requests

headers = {
    'User-Agent': 'M',
    'content-type': 'application/json',
}

data = '{"operationName":"Posts","variables":{"year":2019,"month":1,"day":7,"cursor":"MjA=","includeLayout":false},"query":"query Posts($year: Int, $month: Int, $day: Int, $cursor: String) {\\n posts(first: 20, year: $year, month: $month, day: $day, after: $cursor) {\\n edges {\\n node {\\n id\\n ...PostItemList\\n __typename\\n }\\n __typename\\n }\\n pageInfo {\\n endCursor\\n hasNextPage\\n __typename\\n }\\n __typename\\n }\\n}\\n\\nfragment PostItemList on Post {\\n id\\n ...PostItem\\n __typename\\n}\\n\\nfragment PostItem on Post {\\n id\\n _id\\n comments_count\\n name\\n shortened_url\\n slug\\n tagline\\n updated_at\\n ...CollectButton\\n ...PostThumbnail\\n ...PostVoteButton\\n ...TopicFollowButtonList\\n __typename\\n}\\n\\nfragment CollectButton on Post {\\n id\\n name\\n isCollected\\n __typename\\n}\\n\\nfragment PostThumbnail on Post {\\n id\\n name\\n thumbnail {\\n id\\n media_type\\n ...MediaThumbnail\\n __typename\\n }\\n ...PostStatusIcons\\n __typename\\n}\\n\\nfragment MediaThumbnail on Media {\\n id\\n image_uuid\\n __typename\\n}\\n\\nfragment PostStatusIcons on Post {\\n name\\n product_state\\n __typename\\n}\\n\\nfragment PostVoteButton on Post {\\n _id\\n id\\n featured_at\\n updated_at\\n disabled_when_scheduled\\n has_voted\\n ... on Votable {\\n id\\n votes_count\\n __typename\\n }\\n __typename\\n}\\n\\nfragment TopicFollowButtonList on Topicable {\\n id\\n topics {\\n edges {\\n node {\\n id\\n ...TopicFollowButton\\n __typename\\n }\\n __typename\\n }\\n __typename\\n }\\n __typename\\n}\\n\\nfragment TopicFollowButton on Topic {\\n id\\n slug\\n name\\n isFollowed\\n ...TopicImage\\n __typename\\n}\\n\\nfragment TopicImage on Topic {\\n name\\n image_uuid\\n __typename\\n}\\n"}'
response = requests.post('https://www.producthunt.com/frontend/graphql', headers=headers, data=data)
data = response.json()
for a in response.json()['data']['posts']['edges']:
    name = a['node']['name']
    subtext = a['node']['tagline']
    votes = a['node']['votes_count']
    category = a['node']['topics']['edges'][0]['node']['name']
    print('-'*20)
    print('Name: ',name)
    print('Tagline: ',subtext)
    print('Category: ',category)
    print('Votes: ',votes)

输出

--------------------
Name:  Newbook Models
Tagline:  Find & book fashion models online
Category:  Productivity
Votes:  57
--------------------
Name:  3Leaf Edibles - Quinoa Granola Bite
Tagline:  Vegan, low dose, high quality edibles.
Category:  Health and Fitness
Votes:  57
--------------------
Name:  Payfacile
Tagline:  We simplify access to online payment.
Category:  Fintech
Votes:  43
--------------------
Name:  Halo by Motorola
Tagline:  Watch over your baby from above
Category:  Kids
Votes:  27
--------------------
Name:  Bowflex Max Intelligence
Tagline:  Bring the benefits of a personal trainer to your home
Category:  Health and Fitness
Votes:  23

解释

头文件，你不需要指定一个user-agent，任何东西都可以，所以我只用了一个字符。虽然content-type是必要的，但您通过试验发现这是一个错误。

我不会说谎的数据我不忍心去尝试分解它，我的猜测是，替代它可能最终得不到你想要的数据。此数据的好处之一是，您可以更改年/月/日，它将为您提供该页面的数据，该页面需要滚动才能触发项目。

您并没有特别问到这一点，但是在每次调用requests.get()时使用f-string之类的东西来输入特定数据将是一种方法。

现在您得到的是一个json对象，方法response.json()将此对象转换为python字典。所以所有归因于字典的方法都可以使用。

我强烈建议你查查response.json()字典，除非你这样做了，否则你不会真正明白这个名字和潜台词的意思。但大多数json对象都是嵌套的数据集，因此您通常需要的数据位于许多键后面，就像本例中一样。对于我循环获取名称/标语的每一篇文章。

response.json()'data''edges‘的列表项的典型输出

{'node': {'id': '142398',
  '__typename': 'Post',
  '_id': 'UG9zdC0xNDIzOTg=',
  'comments_count': 9,
  'name': 'dinely',
  'shortened_url': '/r/p/142398',
  'slug': 'dinely',
  'tagline': 'Up to 50% off top restaurants, no coupon needed',
  'updated_at': '2020-04-28T16:55:48-07:00',
  'topics': {'edges': [{'node': {'id': '2',
      '__typename': 'Topic',
      'slug': 'android',
      'name': 'Android',
      'isFollowed': False,
      'image_uuid': 'd3e235c7-437b-4ed1-8298-2ce04eded455'},
     '__typename': 'TopicEdge'},
    {'node': {'id': '8',
      '__typename': 'Topic',
      'slug': 'iphone',
      'name': 'iPhone',
      'isFollowed': False,
      'image_uuid': '0ee71650-973d-4933-a3eb-c7201950db4b'},
     '__typename': 'TopicEdge'},
    {'node': {'id': '159',
      '__typename': 'Topic',
      'slug': 'drinking',
      'name': 'Drinking',
      'isFollowed': False,
      'image_uuid': '33a0c652-253d-491f-b86b-4e9ba32ee203'},
     '__typename': 'TopicEdge'},
    {'node': {'id': '250',
      '__typename': 'Topic',
      'slug': 'travel',
      'name': 'Travel',
      'isFollowed': False,
      'image_uuid': '0a49cae9-ccff-47f1-8998-d0f47c2e7775'},
     '__typename': 'TopicEdge'},
    {'node': {'id': '278',
      '__typename': 'Topic',
      'slug': 'e-commerce',
      'name': 'E-Commerce',
      'isFollowed': False,
      'image_uuid': '1aa939fc-89dd-49ae-9fde-d456f9a6c8d2'},
     '__typename': 'TopicEdge'}],
   '__typename': 'TopicConnection'},
  'featured_at': '2019-01-07T03:52:19-08:00',
  'disabled_when_scheduled': True,
  'has_voted': False,
  'votes_count': 58,
  'thumbnail': {'id': '703734',
   'media_type': 'image',
   '__typename': 'Media',
   'image_uuid': '96893366-45e6-477e-9bb8-0e04c2070da6'},
  'product_state': 'default',
  'isCollected': False},
 '__typename': 'PostEdge'}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63372235

复制

相似问题

问整个页面都没有使用Beautiful Soup进行解析
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问整个页面都没有使用Beautiful Soup进行解析EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问整个页面都没有使用Beautiful Soup进行解析
EN