文章/答案/技术大牛

发布

社区首页 >问答首页 >Python3中的一小段爬虫代码中的问题

问Python3中的一小段爬虫代码中的问题
EN

Stack Overflow用户

提问于 2017-04-18 16:28:19

回答 1查看 49关注 0票数 0

我试着写一个蜘蛛来从Steam's top-sellers列表中获取一些信息。但是我的代码有一些问题。我想是关于're‘模块的，因为我不能在for循环中打印那些代码。运行代码时，它总是在我提供的文件中写入"[]“。

def getDetail(self, url):
    source = self.getSource(url)
    pattern = re.compile('<div class="col search_name ellipsis"><span class="title">(.*?)</span>', re.S)
    items = re.findall(pattern, source)
    print(re.findall(pattern, source))
    number = 1
    for item in items:
        print('Crawling No.%d game' % number)
        print('Name: %s' % item[0])
        number += 1
        time.sleep(0.1)
    return items

这是我的全部代码。

import requests
import re
import time


class Spider(object):
    def __init__(self):
        self.siteURL = 'http://store.steampowered.com/search/?filter=topsellers'

    def getSource(self, url):
        user_agent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/45.0.2454.101 Safari/537.36'
        headers = {'User_agent': user_agent}
        r = requests.get(url, headers=headers)
        return r.text

    def getDetail(self, url):
        source = self.getSource(url)
        pattern = re.compile('<div class="col search_name ellipsis"><span class="title">(.*?)</span>', re.S)
        items = re.findall(pattern, source)
        print(re.findall(pattern, source))
        number = 1
        for item in items:
            print('Crawling No.%d game' % number)
            print('Name: %s' % item[0])
            number += 1
            time.sleep(0.1)
        return items

    def saveDetail(self):
        data = str(self.getDetail(self.siteURL))
        fileName = 'SteamTopseller.txt'
        f = open(fileName, 'wb')
        f.write(data.encode('utf-8'))
        print('Successfully written!')
        f.close()

if __name__ == '__main__':
    spider = Spider()
    spider.saveDetail()

请帮我解决这个小问题，谢谢！顺便说一下，我是用python3编写代码的。

python

web-crawler

回答 1

Stack Overflow用户

发布于 2017-04-18 17:13:04

re.findall(pattern，string，flags=0)

以字符串列表的形式返回字符串中模式的所有非重叠匹配。

因此，如果字符串中没有匹配项，它将返回一个空列表，如[]。

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

要跳过"[]"，您可以编写如下代码

items = re.findall(pattern, source)
if items:
    print(items)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

删除换行符

def getDetail(self, url):
    source = self.getSource(url).replace("\r", "").replace("\n", "").replace("\t", "")

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43467275

复制

相似问题

问Python3中的一小段爬虫代码中的问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3中的一小段爬虫代码中的问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3中的一小段爬虫代码中的问题
EN