首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >抓取爬虫:无法将多个urls存储到postgres中

抓取爬虫:无法将多个urls存储到postgres中
EN

Stack Overflow用户
提问于 2021-12-15 15:15:21
回答 1查看 51关注 0票数 0

我使用刮伤python.I创建了一个爬虫,我想将由爬虫获取的多个urls存储到postgres table.When中,启动爬虫,抓取urls并将表创建到postgres中,但是数据没有被存储。

使用的技术: Scrapy

输出为: urls应该存储在postgres表中。

错误:我无法存储所有的urls.The爬虫并不适用于所有的网站。

请帮助!

代码语言:javascript
复制
import scrapy
import os
import psycopg2

conn = psycopg2.connect(
   database="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432'
)
print("connected")
conn.autocommit = True
cur=conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS tmp_crawler
(
WEBSITE VARCHAR(500) NOT NULL
)

""")


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com//'] 
    

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            abs_url = response.urljoin(url)
            var1  = "INSERT INTO tmp_crawler(website) VALUES('" + url + "')"
         cur.execute(var1)
        conn.commit()
        yield {'title': abs_url}

EN

回答 1

Stack Overflow用户

发布于 2021-12-15 17:56:28

您可以使用scrapy ITEM_PIPELINES来实现这一点。见下面的示例实现

代码语言:javascript
复制
import scrapy
import psycopg2

class DBPipeline(object):
    def open_spider(self, spider):
        # connect to database
        try:
            self.conn = psycopg2.connect(database = "postgres", user = "postgres", password = "password", host = "127.0.0.1", port = "5432")
            self.conn.autocommit = True
            self.cur = self.conn.cursor()
        except:
            spider.logger.error("Unable to connect to database") 

        # create the table
        try:
            self.cur.execute("CREATE TABLE IF NOT EXISTS tmp_crawler (website VARCHAR(500) NOT NULL);")
        except:
            spider.logger.error("Error creating table `tmp_crawler`") 

    def process_item(self, item, spider):
        try:
            self.cur.execute('INSERT INTO tmp_crawler (website) VALUES (%s)', (item.get('title'),))
            spider.logger.info("Item inserted to database")
        except Exception as e:
            spider.logger.error(f"Error `{e}` while inserting item <{item.get('title')}")
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com/'] 
    custom_settings = {
        'ITEM_PIPELINES': {
            DBPipeline: 500
        }
    }

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            yield {'title': response.urljoin(url)}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70366082

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档