文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从python中的javascript变量中提取JSON/Table

问如何从python中的javascript变量中提取JSON/Table
EN

Stack Overflow用户

提问于 2020-06-24 13:36:00

回答 2查看 377关注 0票数 0

我一直在尝试刮 https://support.riverbed.com/content/support/eos_eoa.html，它包含一个在javascript中生成的分页表。

我目前正在使用漂亮的汤来捕获脚本元素。然而，变量EOL_ENTRIES保存了我需要的数据，但是我无法解析它。关于如何成功地刮数据的任何技巧。

最终的目标是将这些数据实际放入PowerBI中，但是PBI只能在第一页上使用刮取。

任何帮助都是非常感谢的。

import re
from bs4 import BeautifulSoup

http = httplib2.Http()
url='https://support.riverbed.com/content/support/eos_eoa.html'
resp, data = http.request(url)
html = data.decode("UTF-8")
soup = BeautifulSoup(html, 'html5lib')

#the 37th script which contains the data
script_with_data = soup.find_all('script')[37]

示例输出如下所示

<script type="text/javascript">

        var EOL_ENTRIES = [
            
            
            
            
                
                    
                    
                        
                    
                    
                
                {
                    productFamily: 'SteelCentral',
                    shortName: 'SteelCentral AppInternals Collector v9',
                    link: 'https:\/\/support.riverbed.com\/content\/support\/eos_eoa\/steelcentral-cascade-opnet\/SteelCentral-AppInternals-Console-v9-and-AppInternals-Collector-v9-BrowserMetrix-OnPremise.html',
                    linkOverride: 'https:\/\/support.riverbed.com\/content\/support\/eos_eoa\/steelcentral-cascade-opnet\/SteelCentral-AppInternals-Console-v9-and-AppInternals-Collector-v9-BrowserMetrix-OnPremise.html',
                    sku: 'AIXCOL',
                    skuOverride: '',
                    description: 'SteelCentral AppInternals Collector v9',
                    limitedAvailability: '',
                    limitedAvailabilityFormatted: '',
                    endOfAvailability: 'Wed Jul 03 00:00:00 PDT 2019',
                    endOfAvailabilityFormatted: 'Wed Jul 03 00:00:00 PDT 2019',
                    endOfSupportFeatures: 'Sat Aug 31 00:00:00 PDT 2019',
                    endOfSupportFeaturesFormatted: 'Sat Aug 31 00:00:00 PDT 2019',
                    endOfSupportMaintenance: 'Sat Aug 31 00:00:00 PDT 2019',
                    endOfSupportMaintenanceFormatted: 'Sat Aug 31 00:00:00 PDT 2019'}
];

web-scraping

beautifulsoup

powerbi

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-06-24 15:46:43

数据在JavaScript中，因此需要进行一些预处理才能用json模块加载数据：

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://support.riverbed.com/content/support/eos_eoa.html'
soup = BeautifulSoup(requests.get(url).content, 'html5lib')

# select <script> tag of interest
s = soup.find(lambda t: t.name == 'script' and 'var EOL_ENTRIES' in t.text)

# extract string from this script tag
t = re.search(r'var EOL_ENTRIES = (\[.*\]);', s.text, flags=re.S)[1]

# preprocess the string
t = t.replace("'", '"')
t = re.sub(r'^(\s*)(.*?):', r'\1"\2":', t, flags=re.M)

# decode string to Python data
data = json.loads(t)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

# print some data to screen:
for product in data:
    print('{:<40} {:<40} {}'.format(product['sku'], product['productFamily'], product['shortName']))

指纹：

AIXCOL                                   SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-AN                                SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-LP                                SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-LP-MODEL                          SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-LS                                SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-LS-MODEL                          SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-SITE                              SteelCentral                             SteelCentral AppInternals Collector v9
AIXCOL-SUB-LIC                           SteelCentral                             SteelCentral AppInternals Collector v9
PANCOL                                   SteelCentral                             SteelCentral AppInternals Collector v9


... and so on.

票数 3

Stack Overflow用户

发布于 2020-06-24 16:12:13

如果您试图将数据导入Power，我将采取不同的方法:提取数据并将其加载到dataframe中，然后将其导出到PBI可以读取的Excel/CSV中。

所以我想试试这个：

import pandas as pd    
prods = str(script_with_data).split('var EOL_ENTRIES = [')[1].split('}')[0].replace('\t','').replace('\n','').split(',')
    rows = []
    for prod in prods:
        row = []
        row.extend([prod.split(': ')[0],prod.split(': ')[1].replace("'","")])
        rows.append(row)
    
    pd.DataFrame(rows)

从这里开始，使用标准熊猫方法导出数据。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62556397

复制

相似问题

问如何从python中的javascript变量中提取JSON/Table
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从python中的javascript变量中提取JSON/TableEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从python中的javascript变量中提取JSON/Table
EN