文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Scrapy进行Python数据抓取

问使用Scrapy进行Python数据抓取
EN

Stack Overflow用户

提问于 2013-05-28 14:49:26

回答 4查看 11K关注 0票数 8

我想从一个网站上有TextFields，按钮等抓取数据。我的要求是填写文本字段并提交表单以获得结果，然后从结果页面中抓取数据点。

我想知道Scrapy是否有这个功能，或者是否有人可以推荐一个Python库来完成这个任务？

(编辑)

我想从下面的网站上抓取数据：

http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

我的要求是从ComboBoxes中选择值并单击search按钮，然后从结果页面中抓取数据点。

附注:我正在使用selenium Firefox驱动程序从其他网站抓取数据，但这个解决方案并不好，因为selenium Firefox驱动程序依赖于Firefox的EXE，即在运行刮板之前必须安装Firefox。

火狐有时会在执行刮板的过程中崩溃，不知道为什么。此外，我需要无窗口抓取，这是不可能的情况下Selenium Firefox驱动程序。

我的最终目标是在Heroku上运行抓取器，我在那里有Linux环境，所以selenium Firefox驱动程序不能在Heroku上工作。谢谢

scrapy

python

python-2.7

web-scraping

回答 4

Stack Overflow用户

回答已采纳

发布于 2013-05-28 16:06:26

基本上，您有很多工具可供选择：

scrapy
beautifulsoup
lxml
mechanize
requests (和grequests)
selenium
ghost.py

这些工具有不同的用途，但它们可以根据任务的不同而混合在一起。

Scrapy是一个强大且非常智能的工具，用于抓取网站、提取数据。但是，当涉及到操作页面时:点击按钮，填写表单-它变得更加复杂：

有时，通过直接在scrapy

sometimes，中进行底层表单操作来模拟填写/提交表单很容易，您必须使用其他工具来帮助实现机械化或selenium

如果你的问题更具体，这将有助于理解你应该使用什么类型的工具或从中选择。

看一个有趣的scrapy&selenium混合的例子。在这里，selenium的任务是单击按钮并提供scrapy项的数据：

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

更新：

这里有一个关于如何在你的案例中使用scrapy的例子：

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

将其保存在spider.py中并通过scrapy runspider spider.py -o output.json运行，在output.json中您将看到：

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

希望这能有所帮助。

票数 18

Stack Overflow用户

发布于 2013-05-28 15:13:01

如果您只是想提交表单并从结果页面中提取数据，我会选择：

发送post request

beautiful soup以从结果页提取所选数据的
requests

Scrapy的附加值真的持有其跟踪链接和爬行网站的能力，我不认为它是正确的工具，如果你确切地知道你正在寻找什么。

票数 3

Stack Overflow用户

发布于 2013-05-28 15:05:22

我个人会使用mechanize，因为我没有任何使用scrapy的经验。然而，一个名为scrapy的库是专门为屏幕抓取而构建的，它应该可以完成这项任务。我会两个都试一试，看看哪一个做得最好/最容易。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16785540

复制

相似问题

问使用Scrapy进行Python数据抓取
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Scrapy进行Python数据抓取EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Scrapy进行Python数据抓取
EN