文章/答案/技术大牛

发布

社区首页 >问答首页 >我如何使这个网页爬虫无限？

问我如何使这个网页爬虫无限？
EN

Stack Overflow用户

提问于 2015-08-18 11:54:51

回答 2查看 105关注 0票数 0

这是我正在尝试编写的代码(一个web爬虫，它遍历一个链接列表，其中第一个链接是原始链接，然后站点上的链接被追加到列表中，for循环继续遍历列表，因为某种原因，当大约150个链接被追加和打印时，脚本一直停止)

import requests
from bs4 import BeautifulSoup
import urllib.request

links = ['http://example.com']
def spider(max_pages):
    page = 1
    number = 1
    while page <= max_pages:
        try:
            for LINK in links:
                url = LINK
                source_code = requests.get(url)
                plain_text = source_code.text
                soup = BeautifulSoup(plain_text, "html.parser")
                for link in soup.findAll("a"):
                    try:
                        href = link.get("href")
                        if href.startswith("http"):
                            if href not in links:
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))
                    except:
                        pass

        except Exception as e:
            print(e)

while True:
    spider(10000)

我该怎么做才能使它无限呢？

python

web-scraping

beautifulsoup

web-crawler

python-requests

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-08-18 12:29:33

当您找到一个没有<a>属性的href元素时，这个错误就会发生。在尝试调用startswith之前，您应该检查链接实际上有一个href。

票数 2

Stack Overflow用户

发布于 2015-08-18 12:41:47

Samir Chahine

您的代码失败，因为href变量在

href = link.get("href")

因此，在那里再加一张支票，如下：

if (href is not none) and href.startswith("http://")

Plz转换python代码中的逻辑

    try to debug using print statement like :



href = link.get("href")
                        print("href "+ href)
                        if href is not none and href.startswith("http"):
                            print("Condition passed 1")
                            if href not in links:
                                print("Condition passed 2")
                                number += 1
                                links.append(href)
                                print("{}: {}".format(number, href))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32071725

复制

相似问题

问我如何使这个网页爬虫无限？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何使这个网页爬虫无限？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何使这个网页爬虫无限？
EN