首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >scraper scraping字段"Description“

scraper scraping字段"Description“
EN

Stack Overflow用户
提问于 2019-12-02 00:28:44
回答 1查看 54关注 0票数 1

我有一个网络刮板为我使用scrapy编码。

我希望添加一个额外的领域,从网站上的刮刀是从。

在CSV数据库中创建了列标题"Description“,但没有进行任何擦除。

代码语言:javascript
复制
# -*- coding: utf-8 -*-
import scrapy
from pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import csv,re
from scrapy import signals
class Rapid7(scrapy.Spider):
    name = 'vulns'
    allowed_domains = ['rapid7.com']
    main_url = 'https://www.rapid7.com/db/?q=&type=nexpose&page={}'
    #start_urls = ['https://www.rapid7.com/db/vulnerabilities']
    keys = ['Published','CVEID', 'Added', 'Modified', 'Related', 'Severity', 'CVSS', 'Created', 'Solution', 'References', 'Description', 'URL']
    def __init__(self):
        SignalManager(dispatcher.Any).connect(receiver=self._close, signal=signals.spider_closed)
    def start_requests(self):
        for i in range(1,10):
            url = self.main_url.format(i)
            yield scrapy.Request(url,callback=self.parse)
    def parse(self, response):
        flag = True
        temp = response.xpath('//div[@class="vulndb__intro-content"]/p/text()').extract_first()
        if temp:
            if temp.strip()=='An error occurred.':
                flag= False
        temp = [i for i in response.xpath('//*[@class="results-info"]/parent::div/p/text()').extract()if i.strip()]
        if len(temp)==1:
            flag= False
        if flag:
            for article in response.xpath('//*[@class="vulndb__results"]/a/@href').extract():
                yield scrapy.Request(response.urljoin(article), callback=self.parse_article, dont_filter=True)

    def parse_article(self,response):
        item=dict()
        item['Published'] = item['Added'] = item['Modified'] = item['Related'] = item['Severity'] = item['Description'] =''
        r=response.xpath('//h1[text()="Related Vulnerabilities"]/..//a/@href').extract()
        temp = response.xpath('//meta[@property="og:title"]/@content').extract_first()
        item['CVEID'] = ''
        try:
            temp2 = re.search('(CVE-.*-\d*)',temp).groups()[0]
            if ":" in temp2:
                raise KeyError
        except:
            try:
                temp2 = re.search('(CVE-.*):',temp).groups()[0]
            except:
                temp2 = ''
        if temp2:
            item['CVEID'] = temp2.replace(': Important',"").replace(')','')
        table = response.xpath('//section[@class="tableblock"]/div')
        for row in table:
            header = row.xpath('header/text()').extract_first()
            data = row.xpath('div/text()').extract_first()
            item[header]=data
        temp = [i for i in response.xpath('//div[@class="vulndb__related-content"]//text()').extract() if i.strip()]
        for ind,i in enumerate(temp):
            if "CVE" in i:
                temp[ind] = i.replace(' ','')

        item['Related']= ", ".join(temp) if temp else ""
        temp2= [i for i in response.xpath('//h4[text()="Solution(s)"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['Solution'] =", ".join(temp2) if temp2 else ''
        temp3 = [i for i in response.xpath('//h4[text()="References"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['References'] = ", ".join(temp3) if temp3 else ''
        temp4 = [i for i in response.xpath('//h4[text()="Description"]/parent::*/ul/li/text()').extract() if i.strip()]
        item['Description'] = ", ".join(temp4) if temp4 else ''
        item['URL'] = response.request.url
        new_item=dict()
        for key in self.keys:
            if key not in list(item.keys()):
                new_item[key] = ''
            else:
                new_item[key]=item[key]
        yield new_item

    def _close(self):
        print("Done Scraping")

谢谢

“看起来你的帖子大部分都是代码;请添加更多细节。”抱歉的。:(“看起来你的帖子主要是代码;请添加更多细节。”抱歉的。:(

EN

回答 1

Stack Overflow用户

发布于 2019-12-02 13:24:21

试着更换你的temp4

代码语言:javascript
复制
temp4 = [i for i in response.xpath('//h4[text()="Description"]/parent::*/ul/li/text()').extract() if i.strip()]

在以下位置:

代码语言:javascript
复制
temp4 = [i for i in response.xpath('//h4[text()="Description"]/parent::*/p/text()').extract() if i.strip()]

<h4>Description</h4>中,你没有<ul><li>标签,只有<p>

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59127552

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档