我使用刮伤python.I创建了一个爬虫,我想将由爬虫获取的多个urls存储到postgres table.When中,启动爬虫,抓取urls并将表创建到postgres中,但是数据没有被存储。
使用的技术: Scrapy
输出为: urls应该存储在postgres表中。
错误:我无法存储所有的urls.The爬虫并不适用于所有的网站。
请帮助!
import scrapy
import os
import psycopg2
conn = psycopg2.connect(
database="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432'
)
print("connected")
conn.autocommit = True
cur=conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS tmp_crawler
(
WEBSITE VARCHAR(500) NOT NULL
)
""")
class MySpider(scrapy.Spider):
name = 'feed_exporter_test'
allowed_domains=['google.com']
start_urls = ['https://www.google.com//']
def parse(self, response):
urls = response.xpath("//a/@href").extract()
for url in urls:
abs_url = response.urljoin(url)
var1 = "INSERT INTO tmp_crawler(website) VALUES('" + url + "')"
cur.execute(var1)
conn.commit()
yield {'title': abs_url}
发布于 2021-12-15 17:56:28
您可以使用scrapy ITEM_PIPELINES来实现这一点。见下面的示例实现
import scrapy
import psycopg2
class DBPipeline(object):
def open_spider(self, spider):
# connect to database
try:
self.conn = psycopg2.connect(database = "postgres", user = "postgres", password = "password", host = "127.0.0.1", port = "5432")
self.conn.autocommit = True
self.cur = self.conn.cursor()
except:
spider.logger.error("Unable to connect to database")
# create the table
try:
self.cur.execute("CREATE TABLE IF NOT EXISTS tmp_crawler (website VARCHAR(500) NOT NULL);")
except:
spider.logger.error("Error creating table `tmp_crawler`")
def process_item(self, item, spider):
try:
self.cur.execute('INSERT INTO tmp_crawler (website) VALUES (%s)', (item.get('title'),))
spider.logger.info("Item inserted to database")
except Exception as e:
spider.logger.error(f"Error `{e}` while inserting item <{item.get('title')}")
return item
def close_spider(self, spider):
self.cur.close()
self.conn.close()
class MySpider(scrapy.Spider):
name = 'feed_exporter_test'
allowed_domains=['google.com']
start_urls = ['https://www.google.com/']
custom_settings = {
'ITEM_PIPELINES': {
DBPipeline: 500
}
}
def parse(self, response):
urls = response.xpath("//a/@href").extract()
for url in urls:
yield {'title': response.urljoin(url)}https://stackoverflow.com/questions/70366082
复制相似问题