文章/答案/技术大牛

发布

问Webscraping JSON
EN

Stack Overflow用户

提问于 2020-03-31 06:00:14

回答 2查看 83关注 0票数 0

我正在尝试使用那里的api从粘贴盒中抓取新的帖子。它工作得很好，但是我一直收到重复的帖子。我现在正在试着比较两个列表，并区分哪些列表没有改变，但它使它交替发布。我如何修复我的比较列表的方法，以便我可以获得最新的粘贴，而不需要交替重复？这是我当前的代码。

old_response = []
while True:
    try:
        response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()

        for x in old_response:
            response.remove(x)
        response.remove(old_response)


        for i in range(len(response)):
            print(i)
            time.sleep(2.5)
            logger.info("Posted Link")
            #thread = threading.Thread(target=communicate,args=(response, i))
            #thread.start()
            #thread.join()
        old_response = response[:]
    except Exception as e:
        logger.critical(f"ERROR: {e}")
        pass

另外，由于api是私有的，所以我将只展示一个简单的响应。假设你抓取了2个结果。它将返回两个最新结果，如下所示：

[
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=J2CeszTZ",
        "full_url": "https://pastebin.com/J2CeszTZ",
        "date": "1585606093",
        "key": "J2CeszTZ",
        "size": "98",
        "expire": "0",
        "title": "",
        "syntax": "text",
        "user": "irismar"
    },
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=hYJ7Xcmm",
        "full_url": "https://pastebin.com/hYJ7Xcmm",
        "date": "1585606099",
        "key": "hYJ7Xcmm",
        "size": "1371",
        "expire": "0",
        "title": "",
        "syntax": "php",
        "user": ""
    }
]

这是一个简单的响应，如果我们刷新url (http://scrape.pastebin.com/api_scraping.php?limit=2)，它将返回两个最新的结果：

[
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=ZcMJxCwc",
        "full_url": "https://pastebin.com/ZcMJxCwc",
        "date": "1585606208",
        "key": "ZcMJxCwc",
        "size": "266166",
        "expire": "1585606808",
        "title": "OpenEuCalendar",
        "syntax": "text",
        "user": "scholzsebastian"
    },
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=qY5VdbSk",
        "full_url": "https://pastebin.com/qY5VdbSk",
        "date": "1585606143",
        "key": "qY5VdbSk",
        "size": "25",
        "expire": "0",
        "title": "Online jobs",
        "syntax": "text",
        "user": ""
    }
]

如果我使用大量的数据集，它经常交替发布，我试图只检测新的发布，而不保存重复的粘贴。任何帮助都将不胜感激。

python

json

api

python-requests

screen-scraping

回答 2

Stack Overflow用户

发布于 2020-03-31 14:01:07

当response中出现新项目时，我会添加到旧列表中，而不是从当前列表中删除。类似于：

old_response = []
while True:
    try:
        response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()

        for record in response:
            if record in old_response:
               # we have seen it already, skip it then
               continue

            # We haven't seen it, so let's add it
            old_response.append(record)

        for i in range(len(response)):
            print(i)
            time.sleep(2.5)
            logger.info("Posted Link")
            #thread = threading.Thread(target=communicate,args=(response, i))
            #thread.start()
            #thread.join()

        # This should not be needed anymore
        # old_response = response[:]

    except Exception as e:
        logger.critical(f"ERROR: {e}")
        pass

票数 1

Stack Overflow用户

发布于 2020-03-31 14:28:45

我会建立一本字典来收集钥匙和粘贴日期。当服务器返回一个我们已经知道的项(相同的键和日期)时，我们跳过它。

这在将整个设置为生成器时效果最好：

import time
import json
import requests
import logging

def scraper():
    seen_items = {}
    api_url = "http://scrape.pastebin.com/api_scraping.php"

    while True:
        try:
            response = requests.get(api_url, {'limit': 5})
            for item in response.json():
                last_known_date = seen_items.get(item['key'])
                if item['date'] != last_known_date:
                    seen_items[item['key']] = item['date']
                    yield item
            time.sleep(2.5)
        except json.JSONDecodeError as e:
            logging.error(f"Server response: {response.text}")
            return

现在我们可以迭代这些项，就像它们是一个列表一样：

for item in scraper():
    print(item)

待办事项

单独添加其他错误处理程序。避免使用except Exception，这太通用了。
添加了比seen_items更智能的计时机制，通过将

移出函数并将其存储在某个地方来添加持久性。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60940234

复制

相似问题

问Webscraping JSON
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Webscraping JSONEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Webscraping JSON
EN