问使用python 3的web抓取教程？
EN

Stack Overflow用户

提问于 2013-05-28 09:52:40

回答 1查看 10.6K关注 0票数 5

我正在尝试学习python 3.x，这样我就可以抓取网站了。人们推荐我使用Beautiful Soup4或lxml.html。有人能告诉我Python3.x的BeautifulSoup教程或示例的正确方向吗？

谢谢你的帮助。

python

web-scraping

python-3.2

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-08-05 09:37:25

实际上，我刚刚用Python语言编写了包含一些示例代码的a full guide on web scraping。我在Python2.7上编写和测试，但根据Wall of Shame的说法，我使用的两个包(requests和BeautifulSoup)都与Python3完全兼容。

下面是一些代码，帮助您开始使用Python进行web抓取：

import sys
import requests
from BeautifulSoup import BeautifulSoup


def scrape_google(keyword):

    # dynamically build the URL that we'll be making a request to
    url = "http://www.google.com/search?q={term}".format(
        term=keyword.strip().replace(" ", "+"),
    )

    # spoof some headers so the request appears to be coming from a browser, not a bot
    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
        "accept-encoding": "gzip,deflate,sdch",
        "accept-language": "en-US,en;q=0.8",
    }

    # make the request to the search url, passing in the the spoofed headers.
    r = requests.get(url, headers=headers)  # assign the response to a variable r

    # check the status code of the response to make sure the request went well
    if r.status_code != 200:
        print("request denied")
        return
    else:
        print("scraping " + url)

    # convert the plaintext HTML markup into a DOM-like structure that we can search
    soup = BeautifulSoup(r.text)

    # each result is an <li> element with class="g" this is our wrapper
    results = soup.findAll("li", "g")

    # iterate over each of the result wrapper elements
    for result in results:

        # the main link is an <h3> element with class="r"
        result_anchor = result.find("h3", "r").find("a")

        # print out each link in the results
        print(result_anchor.contents)


if __name__ == "__main__":

    # you can pass in a keyword to search for when you run the script
    # be default, we'll search for the "web scraping" keyword
    try:
        keyword = sys.argv[1]
    except IndexError:
        keyword = "web scraping"

    scrape_google(keyword)

如果你只想了解更多关于Python3的一般知识，并且已经熟悉Python2.x，那么从Python2转换到Python3的this article可能会有所帮助。

票数 16

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16782726

复制

相似问题

问使用python 3的web抓取教程？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python 3的web抓取教程？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python 3的web抓取教程？
EN