文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Python的Web抓取表只是返回一个空列表

问使用Python的Web抓取表只是返回一个空列表
EN

Stack Overflow用户

提问于 2020-01-06 11:59:07

回答 1查看 425关注 0票数 0

我试图从这个表中抓取所有数据，使用，从这个网站的所有页面中，到字典中，如下面的代码所示。但是，这只是返回一个空列表。

此外，我还在努力寻找每一家有自己的独立页面的公司，同时也要把它收录到词典中。

from bs4 import BeautifulSoup
import requests 
from pprint import pprint

case_data = []

case_url = 'https://www.dataquest.io'
case_page = requests.get(case_url) 
soup_case = BeautifulSoup(case_page.content, 'html.parser') 
case_table = soup_case.find('div',{'class':'slds-table slds-table--bordered slds-max-medium-table_stacked cCaseList'})

pprint(case_table)

python

python-3.x

web-scraping

beautifulsoup

python-requests

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-06 13:05:04

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
import pandas as pd

options = Options()
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://masked_per_user_request/")
time.sleep(2)

df = pd.read_html(driver.page_source)[0]

df.to_csv('result.csv', index=False)

driver.quit()

输出：单击此处

请注意，数据是通过来自XHR后端whcih的JSON请求呈现的，因此您可以通过POST请求(包括JSON正文数据和Cookies )调用它。

如下所示：

import requests


data = {
    'message': '{"actions":[{"id":"108;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.richText.RichTextController/ACTION$getParsedRichTextValue","callingDescriptor":"UNKNOWN","params":{"html":"<p style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The RSPO aspires to ensure transparency throughout the complaints process and the reporting thereof. Decisions not to disclose information through the RSPO website require motivation on genuine grounds that disclosure will go against the interest of the complaints process and/or may jeopardize the well-being or safety of stakeholders involved, and that non-disclosure does not undermine adherence to the principles and objectives of RSPO:</span></p><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The non-disclosed information relates to a legitimate aim, i.e. peaceful and constructive resolution of complaints in accordance with RSPO objectives and P&amp;C;</span></li></ul><p><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The disclosure of said information threatens harm to that aim; and</span></li></ul><p style=\"text-align: justify;\"><br></p><ul><li style=\"text-align: justify;\"><span style=\"font-size: 14px;\">The harm to the aim is greater than the public interest in having the information disclosed.</span></li></ul><p>&nbsp;</p>"},"version":"47.0","storable":true},{"id":"88;a","descriptor":"apex://ComplaintsCaseController/ACTION$searchCaseList","callingDescriptor":"markup://c:CaseList","params":{"searchString":"","pageNumber":1,"defaultPageSize":"10"}},{"id":"111;a","descriptor":"serviceComponent://ui.communities.components.aura.components.forceCommunity.controller.HeadlineController/ACTION$getInitData","callingDescriptor":"UNKNOWN","params":{"uniqueNameOrId":"","pageType":""},"version":"47.0","storable":true}]}',
    'aura.context': '{"mode":"PROD","fwuid":"5fuxCiO1mNHGdvJphU5ELQ","app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"0luQG4JZE_TU28tAfQgGSA"},"dn":[],"globals":{},"uad":false}',
    'aura.pageURI': '/Complaint/s/casetracker',
    'aura.token': 'undefined'
}

r = requests.post("https://masked_per_user_request/", json=data).json()


print(r)

您需要找出Cookies参数。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59611727

复制

相似问题

问使用Python的Web抓取表只是返回一个空列表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python的Web抓取表只是返回一个空列表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python的Web抓取表只是返回一个空列表
EN