文章/答案/技术大牛

发布

社区首页 >问答首页 >通过Python进行Web抓取请求返回胡言乱语

问通过Python进行Web抓取请求返回胡言乱语
EN

Stack Overflow用户

提问于 2018-04-05 04:02:41

回答 3查看 576关注 0票数 0

我对Python非常非常陌生，我想我应该尝试一些实际的应用程序。

我正在尝试使用requests库创建一个基本的web价格爬行器。我选择了这个网页:https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy

这是我使用的基本结构：

import requests

page = requests.get("my url from above")
page

page.content

但由于某些原因，通过.content或.text进行的html打印看起来非常错误。我看到的不是html结构，而是看起来像是一大堆回车符。肯定有缺失的数据。

我尝试使用漂亮的方法(html-parser、html5lib等)进行解析。这就减少了更多的数据。

这只是一种阻止抓取的编码方式，还是我做错了什么？

python

回答 3

Stack Overflow用户

发布于 2018-04-05 05:00:10

问题:

你面临的问题是在htmls中嵌入了javascript，因此你会在html页面中看到数据丢失。这里(Requests_html)是一个非常好的库，它被设计用来请求kennethreitz的htmls。

示例代码:

from requests_html import *
sessions = Session()
r = sessions.get('https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy')
for lines in r.iter_lines() :
    print(lines)

示例输出

由于评论大小的限制，我不能发布完整的html，这是上面打印的HTML片段

b'<!doctype html>'
b'<html>'
b'<head>'
b'<meta charset="utf-8">'
b'<title>Self Storage Units at 15555 West Dixie Highway, North Miami Beach, FL 33162 | US Storage Centers</title>'
b'<base href="/">'
b'<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />'
b'<meta name="description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />'
b'<meta property="og:type" content="website" />'
b'<meta property="og:locale" content="en_US" />'
b'<meta property="og:site_name" content="US Storage Centers" />'
b'<meta property="og:title" content="Self Storage North Miami Beach" />'
b'<meta property="og:url" content="https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy" />'
b'<meta property="og:description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />'
b'<meta property="og:image" content="https://www.usstoragecenters.com/www/images/ussc_facility_photos/168/2017-06-15_00-37-08_Self%20Storage%20Building%20Exterior%20Front%20-%20North%20Miami%20Beach%20West%20Dixie%20IMG_5237%208.jpg" />'
b'<script type="application/ld+json">'
b'            {'
b'                    "@context": "http://schema.org",'
b'                    "@type": "WebPage"'
b'                    ,"breadcrumb": {'
b'                            "@context": "http://schema.org",'
b'                            "@type": "BreadcrumbList",'
b'                            "itemListElement": [{'
b'                    "@type": "ListItem",'
b'                    "name": "US Storage Centers",'
b'                    "url": "https://www.usstoragecenters.com/",'
b'                    "position": 0'
b'                }, {'
b'                    "@type": "ListItem",'
b'                    "name": "Storage Units",'
b'                    "url": "https://www.usstoragecenters.com/storage-units",'
b'                    "position": 1'
b'                }, {'
b'                    "@type": "ListItem",'
b'                    "name": "FL",'
b'                    "url": "https://www.usstoragecenters.com/storage-units/fl",'
b'                    "position": 2'
b'                }, {'

 **...... truncated  .....**

票数 1

Stack Overflow用户

发布于 2018-04-05 04:06:38

调用print(page.content)

它将按照应该出现的方式对返回等进行编码(换行符、制表符等)

一个测试：

s = """
     Hey
    \r\r\r\r\r Look
    \t\t\t\t\t\t Here"""
print(s)

输出：

 Hey





 Look
                             Here

票数 0

Stack Overflow用户

发布于 2018-04-05 04:11:44

您在浏览器的开发人员工具中看到的内容与see服务器返回的HTML中的内容并不对应。查看web浏览器中的源代码，您将看到所有网页内容都是由<script>标记中包含的JSON通过JavaScript生成的。

这使您的工作变得容易得多，因为您不必太担心解析HTML，只需从JSON中提取数据：

import json
from bs4 import BeautifulSoup

...

soup = BeautifulSoup(page.text)

# Find the `script` tag with no `src` and 'window.jsonData' in its text
script = soup.find('script', src=None, text=lambda text: 'window.jsonData' in text).get_text()


# The JSON is part of script, so just remove the extra stuff
script = script.strip().replace('window.jsonData = ', '').rstrip(';')

# Now parse it
data = json.loads(script)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49659474

复制

相似问题

问通过Python进行Web抓取请求返回胡言乱语
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过Python进行Web抓取请求返回胡言乱语EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过Python进行Web抓取请求返回胡言乱语
EN