首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >python蜘蛛返回空json文件

python蜘蛛返回空json文件
EN

Stack Overflow用户
提问于 2020-02-19 16:40:09
回答 2查看 336关注 0票数 1

我在python中创建了Json文件,以便使用scrapy存储已抓取的数据,但是json文件是空的,尽管python蜘蛛对所有数据进行了抓取。我试图在爬行命令spider上将所有已擦伤的数据存储到json file.in终端中,显示所有数据,但它没有导入json文件。我找不到任何解决方案,我同时共享文件蜘蛛和items.py

我使用这个命令scrapy爬行刮刀-o products.json

Spider.py

代码语言:javascript
复制
import scrapy
from bs4 import BeautifulSoup as Soup
from ..items import ScrapyArbiItem
import requests
from idna import unicode


class Scraper(scrapy.Spider):
    name = "scraper"

    start_urls = [
          'https://www.fenom.com/en/263-men',
        # 'https://www.fenom.com/en/263-men#/page-2',
        # 'https://www.fenom.com/en/263-men#/page-3',
        # 'https://www.fenom.com/en/263-men#/page-4',
        # 'https://www.fenom.com/en/263-men#/page-5',
        # 'https://www.fenom.com/en/263-men#/page-6',
        # 'https://www.fenom.com/en/263-men#/page-7',
    ]

    def parse(self, response):

        items = ScrapyArbiItem()

        page_soup = Soup(response.text, 'html.parser')
        uls = page_soup.find_all("ul", class_="product_list grid row")[0]
        # import pdb;
        # pdb.set_trace()
        for li in uls.find_all("li", class_="ajax_block_product block_home col-xs-6 col-sm-4 
        col-md-3"):
            data_to_write = []
            try:
                # print("gnbfrgjrnbgfjnbruigbnruig")
                div = li.find('div', class_='product-container')
                left_block = div.find('div', class_="left-block")
                image_container = left_block.find('div', class_="product-image-container")
                image = image_container.find('a')
                image_url_a = image_container.find('a', class_="product_img_link")
                image_url = image_url_a.find('img', class_='replace-2x img-responsive')
                image_url = image_url.get('src')  # image_url
                url = image.get('href')  # url of product
                right_block = div.find('div', class_="right-block")
                right_a = right_block.find('a')
                product = right_a.find('span', class_="product-name")
                product_name = product.text  # product_name
                pp = right_a.find('span', class_="content_price")
                product_p = pp.find('span', class_="product-price")
                product_price = product_p.text  # product_price


                items ['product_name'] = product_name
                items['product_price'] = product_price
                items['url'] = url


                print(items)
                #print(product_name)
                #print(product_price)
                #print(url)
                #print(image_url)
                next_page = url
                # import pdb;pdb.set_trace()
                # print(url)
                # if url:
                #     yield scrapy.Request(url, callback=self.parsetwo, dont_filter=True)
            except:
                pass

items.py

在此文件中,将所有提取的数据安排到临时容器中。

代码语言:javascript
复制
import scrapy

class ScrapyArbiItem(scrapy.Item):
    # define the fields for your item here like:
    product_name = scrapy.Field()
    product_price = scrapy.Field()
    url = scrapy.Field()
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-02-19 18:16:14

我使用产量(项目)而不是打印(项目),它解决问题。

代码语言:javascript
复制
`import scrapy
from bs4 import BeautifulSoup as Soup
from ..items import ScrapyArbiItem
import requests
from idna import unicode


class Scraper(scrapy.Spider):
    name = "scraper"

    page_number = 2 #for paginatiom

    start_urls = [
          'https://www.fenom.com/en/263-men#/page-1', #firstpage
    ]

    def parse(self, response):

        items = ScrapyArbiItem() #for items container-storing extracted data

        page_soup = Soup(response.text, 'html.parser')
        uls = page_soup.find_all("ul", class_="product_list grid row")[0]

        for li in uls.find_all("li", class_="ajax_block_product block_home col-xs-6 col-sm-4 col-md-3"):

            try:
                # print("gnbfrgjrnbgfjnbruigbnruig")
                div = li.find('div', class_='product-container')
                left_block = div.find('div', class_="left-block")
                image_container = left_block.find('div', class_="product-image-container")
                image = image_container.find('a')
                image_url_a = image_container.find('a', class_="product_img_link")
                image_url = image_url_a.find('img', class_='replace-2x img-responsive')
                image_url = image_url.get('src')  # image_url
                url = image.get('href')  # url of product
                right_block = div.find('div', class_="right-block")
                right_a = right_block.find('a')
                product = right_a.find('span', class_="product-name")
                product_name = product.text  # product_name
                pp = right_a.find('span', class_="content_price")
                product_p = pp.find('span', class_="product-price")
                product_price = product_p.text  # product_price


                items ['product_name'] = product_name
                items['product_price'] = product_price
                items['url'] = url


                yield (items)
                #print(product_name)
                #print(product_price)
                #print(url)
                #print(image_url)
            except:
                pass`
票数 2
EN

Stack Overflow用户

发布于 2020-02-19 17:35:21

看起来现在您所需要做的就是返回items对象,然后就可以开始了。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60305178

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档