我一直在尝试刮 https://support.riverbed.com/content/support/eos_eoa.html,它包含一个在javascript中生成的分页表。
我目前正在使用漂亮的汤来捕获脚本元素。然而,变量EOL_ENTRIES保存了我需要的数据,但是我无法解析它。关于如何成功地刮数据的任何技巧。
最终的目标是将这些数据实际放入PowerBI中,但是PBI只能在第一页上使用刮取。
任何帮助都是非常感谢的。
import re
from bs4 import BeautifulSoup
http = httplib2.Http()
url='https://support.riverbed.com/content/support/eos_eoa.html'
resp, data = http.request(url)
html = data.decode("UTF-8")
soup = BeautifulSoup(html, 'html5lib')
#the 37th script which contains the data
script_with_data = soup.find_all('script')[37]示例输出如下所示
<script type="text/javascript">
var EOL_ENTRIES = [
{
productFamily: 'SteelCentral',
shortName: 'SteelCentral AppInternals Collector v9',
link: 'https:\/\/support.riverbed.com\/content\/support\/eos_eoa\/steelcentral-cascade-opnet\/SteelCentral-AppInternals-Console-v9-and-AppInternals-Collector-v9-BrowserMetrix-OnPremise.html',
linkOverride: 'https:\/\/support.riverbed.com\/content\/support\/eos_eoa\/steelcentral-cascade-opnet\/SteelCentral-AppInternals-Console-v9-and-AppInternals-Collector-v9-BrowserMetrix-OnPremise.html',
sku: 'AIXCOL',
skuOverride: '',
description: 'SteelCentral AppInternals Collector v9',
limitedAvailability: '',
limitedAvailabilityFormatted: '',
endOfAvailability: 'Wed Jul 03 00:00:00 PDT 2019',
endOfAvailabilityFormatted: 'Wed Jul 03 00:00:00 PDT 2019',
endOfSupportFeatures: 'Sat Aug 31 00:00:00 PDT 2019',
endOfSupportFeaturesFormatted: 'Sat Aug 31 00:00:00 PDT 2019',
endOfSupportMaintenance: 'Sat Aug 31 00:00:00 PDT 2019',
endOfSupportMaintenanceFormatted: 'Sat Aug 31 00:00:00 PDT 2019'}
];发布于 2020-06-24 15:46:43
数据在JavaScript中,因此需要进行一些预处理才能用json模块加载数据:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://support.riverbed.com/content/support/eos_eoa.html'
soup = BeautifulSoup(requests.get(url).content, 'html5lib')
# select <script> tag of interest
s = soup.find(lambda t: t.name == 'script' and 'var EOL_ENTRIES' in t.text)
# extract string from this script tag
t = re.search(r'var EOL_ENTRIES = (\[.*\]);', s.text, flags=re.S)[1]
# preprocess the string
t = t.replace("'", '"')
t = re.sub(r'^(\s*)(.*?):', r'\1"\2":', t, flags=re.M)
# decode string to Python data
data = json.loads(t)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
for product in data:
print('{:<40} {:<40} {}'.format(product['sku'], product['productFamily'], product['shortName']))指纹:
AIXCOL SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-AN SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-LP SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-LP-MODEL SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-LS SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-LS-MODEL SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-SITE SteelCentral SteelCentral AppInternals Collector v9
AIXCOL-SUB-LIC SteelCentral SteelCentral AppInternals Collector v9
PANCOL SteelCentral SteelCentral AppInternals Collector v9
... and so on.发布于 2020-06-24 16:12:13
如果您试图将数据导入Power,我将采取不同的方法:提取数据并将其加载到dataframe中,然后将其导出到PBI可以读取的Excel/CSV中。
所以我想试试这个:
import pandas as pd
prods = str(script_with_data).split('var EOL_ENTRIES = [')[1].split('}')[0].replace('\t','').replace('\n','').split(',')
rows = []
for prod in prods:
row = []
row.extend([prod.split(': ')[0],prod.split(': ')[1].replace("'","")])
rows.append(row)
pd.DataFrame(rows)从这里开始,使用标准熊猫方法导出数据。
https://stackoverflow.com/questions/62556397
复制相似问题