文章/答案/技术大牛

发布

社区首页 >问答首页 >Web Crawler不使用Python

问Web Crawler不使用Python
EN

Stack Overflow用户

提问于 2017-01-14 04:54:18

回答 1查看 646关注 0票数 0

我有一个简单的网络爬虫的问题，当我运行以下脚本时，它没有迭代通过网站，它没有给我任何结果。

这是我得到的：

1 Visiting: https://www.mongodb.com/
Word never found

Process finished with exit code 0

为什么这个不能正常工作，有什么建议吗？我使用了以下示例(http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/)

代码如下：

from html.parser import HTMLParser
from urllib.request import urlopen
from urllib import parse

class LinkParser(HTMLParser):
    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it

    def handle_starttag(self, tag, attrs):
      """  We are looking for the begining of a link.
        Links normally look
        like <a href="www.someurl.com"></a> """
      if tag == 'a':
          for (key,value) in attrs:
              if key == 'href':
              # We are grabbing the new URL. We are also adding the
              # base URL to it. For example:
              # www.netinstructions.com is the base and
              # somepage.html is the new URL (a relative URL)
              #
              # We combine a relative URL with the base URL to create
              # an absolute URL like:
              # www.netinstructions.com/somepage.html
                 newUrl = parse.urljoin(self.baseUrl, value)
              # And add it to our colection of links:
                 self.links = self.links + [newUrl]

    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        if response.getheader('Content-Type') == 'text/html':
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)
            return htmlString, self.links
        else:
            return "", []

# And finally here is our spider. It takes in an URL, a word to find,
# and the number of pages to search through before giving up
def spider(url, word, maxPages):
    pagesToVisit = [url]
    numberVisited = 0
    foundWord = False
    # The main loop. Create a LinkParser and get all the links on the page.
    # Also search the page for the word or string
    # In our getLinks function we return the web page
    # (this is useful for searching for the word)
    # and we return a set of links from that web page
    # (this is useful for where to go next)
    while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
        numberVisited = numberVisited +1
        # Start from the beginning of our collection of pages to visit:
        url = pagesToVisit[0]
        pagesToVisit = pagesToVisit[1:]
        try:
            print(numberVisited, "Visiting:", url)
            parser = LinkParser()
            data, links = parser.getLinks(url)
            if data.find(word)>-1:
                foundWord = True
                # Add the pages that we visited to the end of our collection
                # of pages to visit:
                pagesToVisit = pagesToVisit + links
                print(" **Success!**")
        except:
            print(" **Failed!**")
    if foundWord:
        print("The word", word, "was found at", url)
    else:
        print("Word never found")

if __name__ == "__main__":
    spider("https://www.mongodb.com/", "MongoDB" ,400)

python

python-3.x

html-parser

回答 1

Stack Overflow用户

发布于 2017-01-14 06:18:03

首先，编辑内容类型检查器行以：

if response.getheader('Content-Type') == 'text/html; charset=utf-8':

正如@glibdud所建议的。

如果您希望程序检查所有链接，直到到达maxPages或pagesTovisit = []，只需删除行中found word的and条件：

while numberVisited < maxPages and pagesToVisit != [] and not foundWord:

至：

while numberVisited < maxPages and pagesToVisit != []:

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41643270

复制

相似问题

问Web Crawler不使用Python
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web Crawler不使用PythonEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web Crawler不使用Python
EN