文章/答案/技术大牛

发布

社区首页 >问答首页 >Python Scraper文件命名

问Python Scraper文件命名
EN

Stack Overflow用户

提问于 2016-11-07 08:22:57

回答 1查看 117关注 0票数 0

我有一个脚本从github的LiteScraper为基础，从http://ifunny.co抓取表情包和gif

该脚本将所有图像保存在带有时间戳的文件夹中，例如“ifunny( timestamp )”

我是从http://ifunny.co/feeds/shuffle上抓取的，所以我得到了一个随机的页面，每次有10张图片。

问题是，我需要修改脚本，以便它将所有图像保存在给定的文件夹名称中。

我试图删除添加时间戳的代码，但问题是每次它获取多达10个图像并刮擦下一页时，10个新图像会覆盖旧图像。

脚本似乎将图像命名为"1，2，3，4“等

代码如下：

import os
import time
from html.parser import HTMLParser
import urllib.request

#todo: char support for Windows
#deal with triple backslash filter
#recursive parser option


class LiteScraper(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.lastStartTag="No-Tag"
        self.lastAttributes=[]
        self.lastImgUrl=""
        self.Data=[]
        self.acceptedTags=["div","p","h","h1","h2","h3","h4","h5","h6","ul","li","a","img"]
        self.counter=0
        self.url=""


        self.SAVE_DIR="" #/Users/stjepanbrkic/Desktop/temp
        self.Headers=["User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"]

    def handle_starttag(self,tag,attrs):
        #print("Encountered a START tag:",tag)
        self.lastStartTag=tag
        self.lastAttributes=attrs #unnecesarry, might come in hany

        if self.lastStartTag=="img":
            attrs=self.lastAttributes

            for attribute in attrs:
                if attribute[0]=="src":
                    self.lastImgUrl=attribute[1]
                    print(attribute[1])

                    #Allow GIF from iFunny to download
                    for attribute in attrs:
                        if attribute[0]=="data-gif":
                            self.lastImgUrl=attribute[1]
                            print(attribute[1])
                            #End Gif Code

            self.handle_picture(self.lastImgUrl)

    def handle_endtag(self,tag):
        #print("Encountered a END tag:",tag)
        pass

    def handle_data(self,data):
        data=data.replace("\n"," ")
        data=data.replace("\t"," ")
        data=data.replace("\r"," ")
        if self.lastStartTag in self.acceptedTags:
            if not data.isspace():
                print("Encountered some data:",data)
                self.Data.append(data)

        else:
            print("Encountered filtered data.") #Debug

    def handle_picture(self,url):
        print("Bumped into a picture. Downloading it now.")
        self.counter+=1
        if url[:2]=="//":
            url="http:"+url

        extension=url.split(".")
        extension="."+extension[-1]

        try:
            req=urllib.request.Request(url)
            req.add_header(self.Headers[0],self.Headers[1])
            response=urllib.request.urlopen(req,timeout=10)
            picdata=response.read()
            file=open(self.SAVE_DIR+"/pics/"+str(self.counter)+extension,"wb")
            file.write(picdata)
            file.close()
        except Exception as e:
            print("Something went wrong, sorry.")


    def start(self,url):
        self.url=url
        self.checkSaveDir()

        try: #wrapped in exception - if there is a problem with url/server
            req=urllib.request.Request(url)
            req.add_header(self.Headers[0],self.Headers[1])
            response=urllib.request.urlopen(req,timeout=10)
            siteData=response.read().decode("utf-8")
            self.feed(siteData)
        except Exception as e:
            print(e)

        self.__init__()  #resets the parser/scraper for serial parsing/scraping
        print("Done!")

    def checkSaveDir(self):
        #----windows support
        if os.name=="nt":
            container="\ "
            path=os.path.normpath(__file__)
            path=path.split(container[0])
            path=container[0].join(path[:len(path)-1])
            path=path.split(container[0])
            path="/".join(path)
        #no more windows support! :P
        #for some reason, os.normpath returns path with backslashes
        #on windows, so they had to be supstituted with fowardslashes.

        else:
            path=os.path.normpath(__file__)
            path=path.split("/")
            path="/".join(path[:len(path)-1])

        foldername=self.url[7:]
        foldername=foldername.split("/")[0]

        extension=time.strftime("iFunny")+"-"+time.strftime("%d-%m-%Y") + "-" + time.strftime("%Hh%Mm%Ss")

        self.SAVE_DIR=path+"/"+foldername+"-"+extension


        if not os.path.exists(self.SAVE_DIR):
            os.makedirs(self.SAVE_DIR)

        if not os.path.exists(self.SAVE_DIR+"/pics"):
            os.makedirs(self.SAVE_DIR+"/pics")

        print(self.SAVE_DIR)

这是我为了使用脚本而运行的代码：

pastebin .com/PNwJ9wEJ

对于粘贴盒，很抱歉，它不让我发布我的代码...

我是python的新手，所以我不确定如何解决这个问题。有没有可能让它这样做呢？

第一页镜像名称：(1，2，3，4，5，6，7，8，9，10)第二页镜像名称：(11，12，13....)

python

web-scraping

回答 1

Stack Overflow用户

发布于 2016-11-07 08:46:21

每次解析器被实例化(因此对于每个新页面)，counter都被设置为零。这就是为什么图像总是被覆盖的原因。

一种替代方法是确定已经使用了哪些文件名。

i = 0
while os.path.isfile('your_filename_logic_'+str(i)):
    i += 1
# Now i is the first number which hasn't been used.

但如果你有数千张图片，这可能不会像你想要的那样快。

您可以在LiteScraper完成后将计数器存储在一个文件中，并在下一次启动时读回它。

def startMyNewCounter(self):
    if os.path.isfile('your_filename_logic_' + 'count'):
        with open('your_filename_logic_'+'count', 'r') as f:
            self.counter = int(next(f))
    else:
        self.counter = 0

def saveMyCounter(self):
    with open('your_filename_logic_'+'count', 'w') as f:
        f.write(str(self.counter) + '\n')

或者最简单的答案是:如果您在程序关闭后不关心图像，那么可以将计数器设置为全局变量，而不是LiteScraper的成员。因此，每个新的LiteScraper都会从上一个中断的地方重新开始。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40456125

复制

相似问题

问Python Scraper文件命名
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python Scraper文件命名EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python Scraper文件命名
EN