这是我正在尝试编写的代码(一个web爬虫,它遍历一个链接列表,其中第一个链接是原始链接,然后站点上的链接被追加到列表中,for循环继续遍历列表,因为某种原因,当大约150个链接被追加和打印时,脚本一直停止)
import requests
from bs4 import BeautifulSoup
import urllib.request
links = ['http://example.com']
def spider(max_pages):
page = 1
number = 1
while page <= max_pages:
try:
for LINK in links:
url = LINK
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
try:
href = link.get("href")
if href.startswith("http"):
if href not in links:
number += 1
links.append(href)
print("{}: {}".format(number, href))
except:
pass
except Exception as e:
print(e)
while True:
spider(10000)我该怎么做才能使它无限呢?
发布于 2015-08-18 12:29:33
当您找到一个没有<a>属性的href元素时,这个错误就会发生。在尝试调用startswith之前,您应该检查链接实际上有一个href。
发布于 2015-08-18 12:41:47
Samir Chahine
您的代码失败,因为href变量在
href = link.get("href")因此,在那里再加一张支票,如下:
if (href is not none) and href.startswith("http://")Plz转换python代码中的逻辑
try to debug using print statement like :
href = link.get("href")
print("href "+ href)
if href is not none and href.startswith("http"):
print("Condition passed 1")
if href not in links:
print("Condition passed 2")
number += 1
links.append(href)
print("{}: {}".format(number, href))https://stackoverflow.com/questions/32071725
复制相似问题