问Python爬虫(NameError:没有定义名称‘蜘蛛’)
EN

Stack Overflow用户

提问于 2016-08-18 14:02:00

回答 1查看 1.5K关注 0票数 0

我正在尝试运行一个我在http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/网上找到的例子

但是，我在通过Python3.5.2Shell运行它时遇到了问题。

spider("http://www.dreamhost.com", "secure", 200)给了我一个信息：

 Traceback (most recent call last):     File "", line 1, in       spider("[http://www.dreamhost.com](http://www.dreamhost.com/)", "secure", 200)      NameError: name 'spider' is not defined

from html.parser import HTMLParser  
from urllib.request import urlopen  
from urllib import parse

class LinkParser(HTMLParser):

def handle_starttag(self, tag, attrs):
    if tag == 'a':
        for (key, value) in attrs:
            if key == 'href':
                newUrl = parse.urljoin(self.baseUrl, value)
                self.links = self.links + [newUrl]

def getLinks(self, url):
    self.links = []
    self.baseUrl = url
    response = urlopen(url)
    if response.getheader('Content-Type')=='text/html':
        htmlBytes = response.read()
        htmlString = htmlBytes.decode("utf-8")
        self.feed(htmlString)
        return htmlString, self.links
    else:
        return "",[]

def spider(url, word, maxPages):  
    pagesToVisit = [url]
    numberVisited = 0
    foundWord = False
    while numberVisited < maxPages and pagesToVisit != [] and not     foundWord:
    numberVisited = numberVisited +1
    url = pagesToVisit[0]
    pagesToVisit = pagesToVisit[1:]
    try:
        print(numberVisited, "Visiting:", url)
        parser = LinkParser()
        data, links = parser.getLinks(url)
        if data.find(word)>-1:
            foundWord = True
        pagesToVisit = pagesToVisit + links
        print(" **Success!**")
    except:
        print(" **Failed!**")
if foundWord:
    print("The word", word, "was found at", url)
else:
    print("Word never found")

python

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-08-18 14:38:37

你不，

代码伙伴中存在缩进问题。定义类后，在方法handle_starttag和getLinks之前没有缩进。在函数spider中，if-else部分也缺少索引。请根据您提供的链接上的代码检查您的代码。请查找以下更新的工作代码：

from html.parser import HTMLParser  
from urllib.request import urlopen  
from urllib import parse

# We are going to create a class called LinkParser that inherits some
# methods from HTMLParser which is why it is passed into the definition
class LinkParser(HTMLParser):

    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it
    def handle_starttag(self, tag, attrs):
        # We are looking for the begining of a link. Links normally look
        # like <a href="www.someurl.com"></a>
        if tag == 'a':
            for (key, value) in attrs:
                if key == 'href':
                    # We are grabbing the new URL. We are also adding the
                    # base URL to it. For example:
                    # www.netinstructions.com is the base and
                    # somepage.html is the new URL (a relative URL)
                    #
                    # We combine a relative URL with the base URL to create
                    # an absolute URL like:
                    # www.netinstructions.com/somepage.html
                    newUrl = parse.urljoin(self.baseUrl, value)
                    # And add it to our colection of links:
                    self.links = self.links + [newUrl]

    # This is a new function that we are creating to get links
    # that our spider() function will call
    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        if response.getheader('Content-Type')=='text/html':
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)
            return htmlString, self.links
        else:
            return "",[]

# And finally here is our spider. It takes in an URL, a word to find,
# and the number of pages to search through before giving up
def spider(url, word, maxPages):  
    pagesToVisit = [url]
    numberVisited = 0
    foundWord = False
    # The main loop. Create a LinkParser and get all the links on the page.
    # Also search the page for the word or string
    # In our getLinks function we return the web page
    # (this is useful for searching for the word)
    # and we return a set of links from that web page
    # (this is useful for where to go next)
    while numberVisited < maxPages and pagesToVisit != [] or not foundWord:
        numberVisited = numberVisited +1
        # Start from the beginning of our collection of pages to visit:
        url = pagesToVisit[0]
        pagesToVisit = pagesToVisit[1:]
        try:
            print(numberVisited, "Visiting:", url)
            parser = LinkParser()
            data, links = parser.getLinks(url)
            if data.find(word)>-1:
                foundWord = True
                foundAtUrl = url
                # Add the pages that we visited to the end of our collection
                # of pages to visit:
                pagesToVisit = pagesToVisit + links
                print(" **Success!**")
            #Added else, so if desired word not found, then make foundWord = False
            else:
                foundWord = False
        except:
            print(" **Failed!**")
        #Moved this if-else condition block inside while loop, so for every url, it will give us message whether the desired word found or not
        if foundWord:
            print("The word", word, "was found at", url)
        else:
            print("Word never found")

spider("http://www.dreamhost.com", "secure", 200)

如果你还有任何问题/疑问，请告诉我。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39020253

复制

相似问题

问Python爬虫(NameError:没有定义名称‘蜘蛛’)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python爬虫(NameError:没有定义名称‘蜘蛛’)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python爬虫(NameError:没有定义名称‘蜘蛛’)
EN