我想要做的是刮取公司的信息(thisisavailable.eu.pn/company.html),并将所有董事会成员的数据从不同的页面中添加到董事会中。
因此,理想情况下,我从示例页面获得的数据应该是:
{
"company": "Mycompany Ltd",
"code": "3241234",
"phone": "2323232",
"email": "info@mycompany.com",
"board": {
"1": {
"name": "Margaret Sawfish",
"code": "9999999999"
},
"2": {
"name": "Ralph Pike",
"code": "222222222"
}
}
}我搜索过Google等(比如这里、这里和刮伤的医生等),但一直未能找到这样的解决方案。
我能拼凑到的东西:
items.py:
import scrapy
class company_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
board = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
pass
class person_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
pass蜘蛛/例子:
import scrapy
from try.items import company_item,person_item
class ExampleSpider(scrapy.Spider):
name = "example"
#allowed_domains = ["http://thisisavailable.eu.pn"]
start_urls = ['http://thisisavailable.eu.pn/company.html']
def parse(self, response):
if response.xpath("//table[@id='company']"):
yield self.parse_company(response)
pass
elif response.xpath("//table[@id='person']"):
yield self.parse_person(response)
pass
pass
def parse_company(self, response):
Company = company_item()
Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
board = []
for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
Person = person_item()
Person['name'] = person_row.xpath("a/text()").extract()
print (person_row.xpath("a/@href").extract_first())
request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
request.meta['Person'] = Person
return request
board.append(Person)
Company['board'] = board
return Company
def parse_person(self, response):
print('PERSON!!!!!!!!!!!')
print (response.meta)
Person = response.meta['Person']
Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
yield Person更新:正如拉斐尔注意到的,最初的问题是allowed_domains错了--我暂时把它评论掉了,现在运行它时,我得到了(由于代表低而添加到URL中的*'s ):
刮刮爬行示例2017-03-07 :41:12 scrapy.utils.log信息: Scrapy 1.3.2已启动(bot: proov) 2017-03-07 09:41:12 scrapy.utils.log信息:覆盖设置:{‘NEWSPIDER_ scrapy.middleware’:‘proov.sp蜘蛛’,‘SPIDER_scrapy.middleware’:‘proov.perders’,'ROBOTSTXT_OBEY':True,'BOT_NAME':'proov'} }-07-07 09:41:12 scrapy.middleware信息:已启用扩展:'scrapy.extensions.logstats.LogStats',‘scrapy.tensions.telnett.elnetConsole’,'scrapy.extensions.corestats.CoreStats‘2017-03-07 :41:13 scrapy.middleware信息:已启用的下载器中间件:scrapy.middleware'scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats‘2017-03-07 09:41:13 scrapy.middleware信息:启用蜘蛛中间件:'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware‘2017-03-07 :41:13 scrapy.middleware信息:已启用项目管道:[] 2017-03- 09:41:13 scrapy.core.engine信息:蜘蛛打开2017-03-07 :41:13 scrapy.extensions.logstats信息:爬行0页(0页/分钟),刮0项(0项/分) 2017-03-07 09:41:13 scrapy.extensions.telnet调试: Telnet控制台监听127.0.0.1:6023 2017-03-07 :41:14 scrapy.core.engine调试:爬行(404) thisisavailable.eu.pn/robots.txt> (引用: None) 2017-03-07 09:41:14 scrapy.core.engine调试:爬行(200) thisisavailable.eu.pn/scrapy/company.html> (推荐者:0) person.html person2.html 2017-03-07 09:41:15 scrapy.core。引擎调试:爬行(200) http://thisisavailable.eu.pn/person2.html> (引用者:thisisavailable.eu.pn/company.html) PERSON‘名称’:U‘’Kaspar\xe4nnuotsa‘} 2017-03-07 09:41:15 scrapy.core.engine信息:关闭蜘蛛(已完成) 2017-03-07 :41:15 scrapy.statscollectors信息“下载程序/响应_状态_计数/404”:1、“完成”、“完成”、“完成时间”:datetime.datetime(2017年、3、7、7、41、15、571000)、‘item_count_count’:1、'log_count/DEBUG':5、'log_count/INFO':7、‘request_DEBUG_max’:1、'response_received_count':3、‘调度器/排程’:2、‘调度器/去队列/内存’:2,“调度器/排队”:2,“调度器/排队/内存”:2,“开始时间”:datetime.datetime(2017,3,7,7,41,13,404000)} 2017-03-07 09:41:15 scrapy.core.engine信息:蜘蛛侠关闭(完成)
如果与"-o file.json“一起运行,文件内容是:
{“代码”:"222222222",“名称”:“拉尔夫·派克”}
因此,再深入一点,但我仍然不知道如何使它发挥作用。
有人能帮我把这事做好吗?
发布于 2017-03-06 11:29:35
您的问题与拥有多个项目无关,即使它在将来也是如此。
您的问题在输出中得到解释。
scrapy.spidermiddlewares.offsite调试:过滤了对“启玩-翼箱.c9users.io”的外部请求:http://thisisavailable.eu.pn/scrapy/person2.html> 2017-03-06 10:44:33
这意味着它将进入allowed_domains列表之外的域。
您允许的域是错误的。它应该是
allowed_domains = ["thisisavailable.eu.pn"]注意:
不要为Person使用不同的项,只需将其用作Company中的字段,并在刮取时将dict或list分配给它。
https://stackoverflow.com/questions/42621448
复制相似问题