首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在使用python抓取网站时删除重复链接?

如何在使用python抓取网站时删除重复链接?
EN

Stack Overflow用户
提问于 2019-08-22 11:50:28
回答 3查看 727关注 0票数 2

我有以下的代码,爬行给定的网站地址,但问题是,它重复URL时,爬行。我需要唯一和完整的网址列表,可以从网站的主页到达。

请帮助我这样做。

代码语言:javascript
复制
############################################################################################
import scrapy
urlset = set()
class MySpider(scrapy.Spider):

    name = "MySpider"

    def __init__(self, allowed_domains=None, start_urls=None):
        super().__init__()

        if allowed_domains is None:
            self.allowed_domains = []
        else:
            self.allowed_domains = allowed_domains
        if start_urls is None:
            self.start_urls = []
        else:
            self.start_urls = start_urls  

    def parse(self, response):
        print('[parse] url:', response.url)
        # extract all links from page
        all_links = response.xpath('*//a/@href').extract()
        all_links = set(all_links)
        all_links = list(all_links)
        # iterate over links
        for link in all_links:
            if("https:" in link or "http:" in link):
                    if(link not in urlset):
                        print('[+] link:', link)

                        full_link = response.urljoin(link)
                        urlset.add(full_link)
                        print("----------Full Link: "+full_link)
                        request = response.follow(full_link, callback=self.parse)
                        yield request
                        yield {'url': response.url}                        



    # def print_this_link(self, response):
    #     print('[print_this_link] url:', response.url)
    #     title = response.xpath('//title/text()').get() # get() will replace extract() in the future
    #     # text = response.xpath('//body/text()').get()
    #     yield {'url': response.url, 'title': title}


# --- run without creating project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'file://C:/Tmp1/output.csv', # 
})
c.crawl(MySpider)
c.crawl(MySpider, allowed_domains=["copperpodip.com"], start_urls=["https://www.copperpodip.com"])
c.start()

只需按原样运行这段代码。上述代码的输出

运行代码的输出:

代码语言:javascript
复制
C:\Users\Carthaginian\Desktop\projectLink\crawler\crawler\spiders>python stacklink.py
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.7.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2feebff3115b2d5b
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: b27fd364782f9b57
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-22 14:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.025426,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 695429),
 'log_count/INFO': 19,
 'start_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 670003)}
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com> (referer: None)
[parse] url: https://www.copperpodip.com
[+] link: https://www.copperpodip.com/due-diligence
----------Full Link: https://www.copperpodip.com/due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
----------Full Link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/leadership
----------Full Link: https://www.copperpodip.com/leadership
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
----------Full Link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
----------Full Link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/prior-art-search
----------Full Link: https://www.copperpodip.com/prior-art-search
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
----------Full Link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
----------Full Link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/patent-monetization
----------Full Link: https://www.copperpodip.com/patent-monetization
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
----------Full Link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/privacy-policy
----------Full Link: https://www.copperpodip.com/privacy-policy
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/ip-news
----------Full Link: https://www.copperpodip.com/ip-news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com
----------Full Link: https://www.copperpodip.com
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/contact-us
----------Full Link: https://www.copperpodip.com/contact-us
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/blog
----------Full Link: https://www.copperpodip.com/blog
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.linkedin.com/company/copperpod-ip
----------Full Link: https://www.linkedin.com/company/copperpod-ip
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.linkedin.com': <GET https://www.linkedin.com/company/copperpod-ip>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/source-code-review
----------Full Link: https://www.copperpodip.com/source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/request-for-samples
----------Full Link: https://www.copperpodip.com/request-for-samples
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
----------Full Link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-due-diligence
----------Full Link: https://www.copperpodip.com/case-study-due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
----------Full Link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siliconindiamagazine.com': <GET https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/reverse-engineering
----------Full Link: https://www.copperpodip.com/reverse-engineering
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/careers
----------Full Link: https://www.copperpodip.com/careers
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/infringement-claim-charts
----------Full Link: https://www.copperpodip.com/infringement-claim-charts
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-infringement-analysis
----------Full Link: https://www.copperpodip.com/case-study-infringement-analysis
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
----------Full Link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-source-code-review
----------Full Link: https://www.copperpodip.com/case-study-source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
----------Full Link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/due-diligence> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/due-diligence
[+] link: https://www.facebook.com/copperpodip/
----------Full Link: https://www.facebook.com/copperpodip/
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/copperpodip/>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/due-diligence>
{'url': 'https://www.copperpodip.com/due-diligence'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
[+] link: https://www.copperpodip.com/blog/date/2017-03
----------Full Link: https://www.copperpodip.com/blog/date/2017-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/emergingtech
----------Full Link: https://www.copperpodip.com/blog/tag/emergingtech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-09
----------Full Link: https://www.copperpodip.com/blog/date/2018-09
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-02
----------Full Link: https://www.copperpodip.com/blog/date/2018-02
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/itc
----------Full Link: https://www.copperpodip.com/blog/tag/itc
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/intel
----------Full Link: https://www.copperpodip.com/blog/tag/intel
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/iot
----------Full Link: https://www.copperpodip.com/blog/tag/iot
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/nokia
----------Full Link: https://www.copperpodip.com/blog/tag/nokia
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fintech
----------Full Link: https://www.copperpodip.com/blog/tag/fintech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/patents
----------Full Link: https://www.copperpodip.com/blog/tag/patents
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/uber
----------Full Link: https://www.copperpodip.com/blog/tag/uber
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/home%20automation
----------Full Link: https://www.copperpodip.com/blog/tag/home%20automation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/duediligence
----------Full Link: https://www.copperpodip.com/blog/tag/duediligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fake%20news
----------Full Link: https://www.copperpodip.com/blog/tag/fake%20news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/paypal
----------Full Link: https://www.copperpodip.com/blog/tag/paypal
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/virtualreality
----------Full Link: https://www.copperpodip.com/blog/tag/virtualreality
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
----------Full Link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/trademarks
----------Full Link: https://www.copperpodip.com/blog/tag/trademarks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/qualcomm
----------Full Link: https://www.copperpodip.com/blog/tag/qualcomm
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/Apple
----------Full Link: https://www.copperpodip.com/blog/tag/Apple
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/5g
----------Full Link: https://www.copperpodip.com/blog/tag/5g
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/code%20review
----------Full Link: https://www.copperpodip.com/blog/tag/code%20review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/licensing
----------Full Link: https://www.copperpodip.com/blog/tag/licensing
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/internet%20of%20things
----------Full Link: https://www.copperpodip.com/blog/tag/internet%20of%20things
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-03
----------Full Link: https://www.copperpodip.com/blog/date/2018-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/technology
----------Full Link: https://www.copperpodip.com/blog/tag/technology
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-08-22 12:24:14

刮除应该自动避免重新访问以前访问过的urls (使用dupefilter类)。我不太清楚你想在这里做什么,但我认为你想爬上这个网站,找到所有的链接?在这种情况下,您应该将第二个收益率(yield {'url': response.url})移到解析函数的前面。

我认为以下是你想要的:

代码语言:javascript
复制
import scrapy

class MySpider(scrapy.Spider):

    name = "copperpodip"
    start_urls = ["https://copperpodip.com"]
    allowed_domains = ["copperpodip.com"]

    def parse(self, response):
        yield {'url': response.url}
        for link in response.xpath('*//a/@href').getall():
            yield response.follow(link, self.parse)

如果我把它作为:

代码语言:javascript
复制
scrapy runspider scrapy_test.py -o test.json

则生成的json文件不包含任何重复链接。

票数 2
EN

Stack Overflow用户

发布于 2019-08-22 12:02:23

我一点也不知道刮伤,但是你不能用一个列表(或者一组,比较容易)检查一下是否已经有相同链接的记录了吗?

代码语言:javascript
复制
link_list = list
if link not in link_list :
   link_list.append(link)

编辑:您似乎已经使用了一组,然后更改了列表:

代码语言:javascript
复制
all_links = set(all_links)
all_links = list(all_links)
票数 2
EN

Stack Overflow用户

发布于 2019-08-22 12:03:26

这将有效,因为Scrapy将为您处理重复的URL:

代码语言:javascript
复制
def parse(self, response):
    yield {'url': response.url}     
    print('[parse] url:', response.url)
    # extract all links from page
    all_links = response.xpath('*//a/@href').extract()
    # iterate over links
    for link in all_links:
        if("https:" in link or "http:" in link):
            print('[+] link:', link)
            full_link = response.urljoin(link)
            print("----------Full Link: "+full_link)
            request = response.follow(full_link, callback=self.parse)
            yield request
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57608722

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档