文章/答案/技术大牛

发布

社区首页 >问答首页 >抓取爬虫:无法将多个urls存储到postgres中

问抓取爬虫:无法将多个urls存储到postgres中
EN

Stack Overflow用户

提问于 2021-12-15 15:15:21

回答 1查看 51关注 0票数 0

我使用刮伤python.I创建了一个爬虫，我想将由爬虫获取的多个urls存储到postgres table.When中，启动爬虫，抓取urls并将表创建到postgres中，但是数据没有被存储。

使用的技术： Scrapy

输出为： urls应该存储在postgres表中。

错误：我无法存储所有的urls.The爬虫并不适用于所有的网站。

请帮助！

import scrapy
import os
import psycopg2

conn = psycopg2.connect(
   database="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432'
)
print("connected")
conn.autocommit = True
cur=conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS tmp_crawler
(
WEBSITE VARCHAR(500) NOT NULL
)

""")


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com//'] 
    

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            abs_url = response.urljoin(url)
            var1  = "INSERT INTO tmp_crawler(website) VALUES('" + url + "')"
         cur.execute(var1)
        conn.commit()
        yield {'title': abs_url}

python

postgresql

scrapy

回答 1

Stack Overflow用户

发布于 2021-12-15 17:56:28

您可以使用scrapy ITEM_PIPELINES来实现这一点。见下面的示例实现

import scrapy
import psycopg2

class DBPipeline(object):
    def open_spider(self, spider):
        # connect to database
        try:
            self.conn = psycopg2.connect(database = "postgres", user = "postgres", password = "password", host = "127.0.0.1", port = "5432")
            self.conn.autocommit = True
            self.cur = self.conn.cursor()
        except:
            spider.logger.error("Unable to connect to database") 

        # create the table
        try:
            self.cur.execute("CREATE TABLE IF NOT EXISTS tmp_crawler (website VARCHAR(500) NOT NULL);")
        except:
            spider.logger.error("Error creating table `tmp_crawler`") 

    def process_item(self, item, spider):
        try:
            self.cur.execute('INSERT INTO tmp_crawler (website) VALUES (%s)', (item.get('title'),))
            spider.logger.info("Item inserted to database")
        except Exception as e:
            spider.logger.error(f"Error `{e}` while inserting item <{item.get('title')}")
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com/'] 
    custom_settings = {
        'ITEM_PIPELINES': {
            DBPipeline: 500
        }
    }

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            yield {'title': response.urljoin(url)}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70366082

复制

相似问题

问抓取爬虫:无法将多个urls存储到postgres中
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取爬虫:无法将多个urls存储到postgres中EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取爬虫:无法将多个urls存储到postgres中
EN