我目前正在开发一个刮板第一次使用刮刀,我也是第一次使用产量。我还在试着把我的注意力集中在屈服上。
刮刀:
我很难理解如何将来自parse_individual_tabs和parse_individual_listings的JSON组合成一个JSON字符串。这将是一个为每个单独的清单,并将发送到一个API。就算是暂时印出来也行。
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'',
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
url = ""
yield scrapy.Request(url, callback=self.parse_page_contents)
def parse_page_contents(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
urls = []
urls.append(response.url)
for link in pages.find_all('a'):
urls.append('/'.format(link['href']))
for url in urls:
yield scrapy.Request(url, callback=self.parse_page_listings)
def parse_page_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
resultTable = soup.find("table", { "class" : "apas_tbl" })
for row in resultTable.find_all('a'):
url = ""
yield scrapy.Request(url, callback=self.parse_individual_listings)
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
print field.label.text.strip()
print field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request( urlparse.urljoin(response.url, link['href']), callback=self.parse_individual_tabs)至:
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
data = {}
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data}
)
print data。。
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
print json.dumps(data)至
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
yield json.dumps(data)发布于 2016-12-02 16:20:54
通常,在获取数据时,您必须使用Scrapy Items,但它们也可以被字典所取代(这将是您所指的JSON对象),因此我们现在将使用它们:
首先,开始在parse_individual_listings方法中创建项(或字典),就像在parse_individual_tabs中使用data一样。然后将其传递给下一个请求( parse_individual_tabs将用meta参数捕获该请求,因此应该如下所示:
def parse_individual_listings(self, response):
...
data = {}
data[field1] = 'data1'
data[field1] = 'data2'
...
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data};
)然后,您可以在parse_individual_tabs中获得这些数据。
def parse_individual_tabs(self, response):
data = response.meta['data']
...
# keep populating `data`
yield data现在,data in parse_individual_tabs拥有来自两个请求的所有信息,您可以在任何回调请求之间执行相同的操作。
https://stackoverflow.com/questions/40936117
复制相似问题