文章/答案/技术大牛

发布

社区首页 >问答首页 >将刮取的数据加载到Postgresql中

问将刮取的数据加载到Postgresql中
EN

Stack Overflow用户

提问于 2015-06-09 04:34:47

回答 2查看 1.8K关注 0票数 1

我已经结合了一些关于网络抓取的教程，并制作了一个简单的网页爬虫，这是刮新张贴的问题在这里所以。我想将它们加载到我的postgresql数据库中，但是我的爬虫给我看的解码错误有问题。

错误：

2015-06-09 06:07:10+0200 [stack] ERROR: Error processing {'title': u'Laravel 5 Confused when implements ShoudlQueue',
     'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue'}
    Traceback (most recent call last):
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
        self._startRunCallbacks(result)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/petarp/Documents/PyScraping/RealPython/WebScraping/stack/stack/pipelines.py", line 27, in process_item
        session.commit()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 790, in commit
        self.transaction.commit()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 392, in commit
        self._prepare_impl()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl
        self.session.flush()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush
        self._flush(objects)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush
        transaction.rollback(_capture_exception=True)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
        compat.reraise(exc_type, exc_value, exc_tb)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush
        flush_context.execute()
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
        rec.execute(self)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
        uow
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
        mapper, table, insert)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 761, in _emit_insert_statements
        execute(statement, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
        return meth(self, multiparams, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
        return connection._execute_clauseelement(self, multiparams, params)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
        compiled_sql, distilled_params
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
        context)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
        exc_info
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
        reraise(type(exception), exception, tb=exc_tb)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
        context)
      File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
        cursor.execute(statement, parameters)
    sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist
    LINE 1: INSERT INTO reals (title, url) VALUES ('Laravel 5 Confused w...
                                      ^
     [SQL: 'INSERT INTO reals (title, url) VALUES (%(title)s, %(url)s) RETURNING reals.id'] [parameters: {'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue', 'title': u'Laravel 5 Confused when implements ShoudlQueue'}]

我使用sqlalchemy来定义爬虫和postgresql之间的连接。这是settings.py，models.py和pipelines.py。

Settings.py：

BOT_NAME = 'stack'

SPIDER_MODULES = ['stack.spiders']
NEWSPIDER_MODULE = 'stack.spiders'
ITEM_PIPELINES = ['stack.pipelines.StackPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'stack (+http://www.yourdomain.com)'
DATABASE = {
    'drivername': 'postgres',
    'host': 'localhost',
    'port': '5432',
    'username': '********',
    'password': '********',
    'database': '********'
}

Models.py：

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings


DeclarativeBase = declarative_base()


def db_connect():
    """ Performs database connections using database settings from settings.py
        Returns sqlalchemy engine instance
    """
    return create_engine(URL(**settings.DATABASE))


def create_reals_table(engine):
    """"""
    DeclarativeBase.metadata.create_all(engine)


class Reals(DeclarativeBase):
    """SQLAlchemy Reals Model"""
    __tablename__ = 'reals'

    id = Column(Integer, primary_key=True)
    title = Column('title', String)
    url = Column('url', String, nullable=True)

Pipeline.py：

from sqlalchemy.orm import sessionmaker
from models import Reals, db_connect, create_reals_table


class StackPipeline(object):
    """ Stack Exchange pipeline for storing scraped items in the database """
    def __init__(self):
        """ Initialize database connection and sessionmaker """
        engine = db_connect()
        create_reals_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        """Save reals in database.
        This method is called for every item pipeline componenet."""
        session = self.Session()
        real = Reals(**item)

        try:
            session.add(real)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()
        return item

实心桌的谢马：

realpython=# select * from reals limit 5;
 id | title | link 
----+-------+------
(0 rows)

有人能帮我弄清楚里面发生了什么吗？

python

postgresql

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-06-09 05:20:09

错误信息实际上是不言自明的--您只需查看最后几行：

sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist

因此，您要么需要将SQL更改为插入到名为link的列中，要么需要用ALTER TABLE reals RENAME COLUMN link TO url;重命名表中的列。

票数 2

Stack Overflow用户

发布于 2015-06-09 05:27:08

我已经找到了解决办法。

问题是在我的url，link定义在我的Items.py中，我像这样定义了它，在我的模型中，我用link创建了一个模式表，所以我只是替换url whit link，并且数据被成功地加载到postgresql中。

from scrapy import Item, Field


    class StackItem(Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = Field()
        url = Field()

新Items.py：

from scrapy import Item, Field


class StackItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    link = Field()

预期结果：

 id |                                 title                                  |                                          link                                          
----+------------------------------------------------------------------------+----------------------------------------------------------------------------------------
  1 | pointcut execution for specific class constructor                      | /questions/30723494/pointcut-execution-for-specific-class-constructor
  2 | PWX-00001 Error opening repository “dtlmsg.txt”. RCs = 268/150/2       | /questions/30723493/pwx-00001-error-opening-repository-dtlmsg-txt-rcs-268-150-2
  3 | Can anyone share a sample c++ program, that reads ASCII stl type file? | /questions/30723491/can-anyone-share-a-sample-c-program-that-reads-ascii-stl-type-file
  4 | Where should I do the core logic code in express js?                   | /questions/30723487/where-should-i-do-the-core-logic-code-in-express-js
  5 | configuring rails application to make ui router work                   | /questions/30723485/configuring-rails-application-to-make-ui-router-work
(5 rows)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/30723086

复制

相似问题

问将刮取的数据加载到Postgresql中
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将刮取的数据加载到Postgresql中EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将刮取的数据加载到Postgresql中
EN