文章/答案/技术大牛

发布

社区首页 >问答首页 >限制scrapy crawler的页面深度

问限制scrapy crawler的页面深度
EN

Stack Overflow用户

提问于 2020-02-11 13:20:17

回答 1查看 190关注 0票数 0

我有一个抓取器，它接收URLS列表，并扫描它们以寻找额外的链接，然后它会跟随以查找任何看起来像电子邮件的东西(使用REGEX)，并返回urls/电子邮件地址的列表。

我目前已经将其设置在Jupyter Notebook中，因此我可以在测试时轻松查看输出。问题是，它永远不会运行-因为我没有限制刮取器的深度(每个URL)。

理想情况下，抓取器将从每个起始url到最多2-5页的深度。

这是我到目前为止所知道的：

首先，我要导入我的依赖项：

import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
from Urls import URL_List

我在Jupyter Notebook中为使用Scrapy设置了关闭日志和警告：

logging.getLogger('scrapy').propagate = False

从那里，我从我的URL文件中提取URL：

def get_urls():
    urls = URL_List['urls']

然后，我设置了我的爬虫：

class MailSpider(scrapy.Spider):
    name = 'email'
    def parse(self, response):

我搜索URL中的链接。

        links = LxmlLinkExtractor(allow=()).extract_links(response)

然后将URL列表作为输入，逐个读取它们的源代码。

        links = [str(link.url) for link in links]
        links.append(str(response.url))

我将链接从一个解析方法发送到另一个解析方法。并设置回调参数，该参数定义请求URL必须发送到哪个方法。

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link)

然后，我将URLS传递给parse_link方法 - ，该方法应用regex findall来查找电子邮件

    def parse_link(self, response):
        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)
        df.to_csv(self.path, mode='a', header=False)

当我们调用google_urls方法来运行爬行器时，路径列表被作为参数传递，路径定义了CSV文件的保存位置。

然后，我将这些电子邮件保存在CSV文件中：

def ask_user(question):
    response = input(question + ' y/n' + '\n')
    if response == 'y':
        return True
    else:
        return False

def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 
    with open(path, 'wb') as file: 
        file.close()

对于每个网站，我制作了一个包含列的数据框架:电子邮件、链接，并将其附加到先前创建的CSV文件中。

然后，我把它们放在一起：

def get_info(root_file, path):  
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)

    print('Collecting urls...')
    google_urls = get_urls()

    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=google_urls, path=path)

    process.start()

    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)

    return df

get_urls()

最后，我定义了一个关键字并运行抓取器：

keyword = input("Who is the client? ")
df = get_info(f'{keyword}_urls.py', f'{keyword}_emails.csv')

在一个包含100个URLS的列表中，我使用电子邮件地址语法返回了44k个结果。

有人知道如何限制深度吗？

python-requests

jupyter-notebook

python

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-02-11 20:14:15

在您的爬虫中设置DEPTH_LIMIT，如下所示

class MailSpider(scrapy.Spider):
    name = 'email'

    custom_settings = {
        "DEPTH_LIMIT": 5
    }

    def parse(self, response):
        pass

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60162452

复制

相似问题

问限制scrapy crawler的页面深度
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问限制scrapy crawler的页面深度EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问限制scrapy crawler的页面深度
EN