我已经使用Portia网络刮刀创建了一个蜘蛛,开始URL是
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs当我在scrapyd中调度这个蜘蛛的时候
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>['partial']是什么意思,为什么页面上的内容没有被花哨的人刮掉?
发布于 2016-08-17 22:10:31
迟答,但希望不是无用的,因为这种行为的刮刮似乎没有很好的记录。从scrapy源查看此代码行,当请求遇到扭曲的PotentialDataLoss错误时,将设置PotentialDataLoss标志。根据相应的扭曲的文件:
只有在向不设置内容长度或在响应中进行传输编码的HTTP服务器发出请求时,才会发生这种情况。
可能的原因包括:
handle_httpstatus_list或handle_httpstatus_all,使响应不会被HttpErrorMiddleware过滤掉或由RedirectMiddleware获取。https://stackoverflow.com/questions/33606080
复制相似问题