我正在尝试使用那里的api从粘贴盒中抓取新的帖子。它工作得很好,但是我一直收到重复的帖子。我现在正在试着比较两个列表,并区分哪些列表没有改变,但它使它交替发布。我如何修复我的比较列表的方法,以便我可以获得最新的粘贴,而不需要交替重复?这是我当前的代码。
old_response = []
while True:
try:
response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()
for x in old_response:
response.remove(x)
response.remove(old_response)
for i in range(len(response)):
print(i)
time.sleep(2.5)
logger.info("Posted Link")
#thread = threading.Thread(target=communicate,args=(response, i))
#thread.start()
#thread.join()
old_response = response[:]
except Exception as e:
logger.critical(f"ERROR: {e}")
pass另外,由于api是私有的,所以我将只展示一个简单的响应。假设你抓取了2个结果。它将返回两个最新结果,如下所示:
[
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=J2CeszTZ",
"full_url": "https://pastebin.com/J2CeszTZ",
"date": "1585606093",
"key": "J2CeszTZ",
"size": "98",
"expire": "0",
"title": "",
"syntax": "text",
"user": "irismar"
},
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=hYJ7Xcmm",
"full_url": "https://pastebin.com/hYJ7Xcmm",
"date": "1585606099",
"key": "hYJ7Xcmm",
"size": "1371",
"expire": "0",
"title": "",
"syntax": "php",
"user": ""
}
]这是一个简单的响应,如果我们刷新url (http://scrape.pastebin.com/api_scraping.php?limit=2),它将返回两个最新的结果:
[
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=ZcMJxCwc",
"full_url": "https://pastebin.com/ZcMJxCwc",
"date": "1585606208",
"key": "ZcMJxCwc",
"size": "266166",
"expire": "1585606808",
"title": "OpenEuCalendar",
"syntax": "text",
"user": "scholzsebastian"
},
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=qY5VdbSk",
"full_url": "https://pastebin.com/qY5VdbSk",
"date": "1585606143",
"key": "qY5VdbSk",
"size": "25",
"expire": "0",
"title": "Online jobs",
"syntax": "text",
"user": ""
}
]如果我使用大量的数据集,它经常交替发布,我试图只检测新的发布,而不保存重复的粘贴。任何帮助都将不胜感激。
发布于 2020-03-31 14:01:07
当response中出现新项目时,我会添加到旧列表中,而不是从当前列表中删除。类似于:
old_response = []
while True:
try:
response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()
for record in response:
if record in old_response:
# we have seen it already, skip it then
continue
# We haven't seen it, so let's add it
old_response.append(record)
for i in range(len(response)):
print(i)
time.sleep(2.5)
logger.info("Posted Link")
#thread = threading.Thread(target=communicate,args=(response, i))
#thread.start()
#thread.join()
# This should not be needed anymore
# old_response = response[:]
except Exception as e:
logger.critical(f"ERROR: {e}")
pass发布于 2020-03-31 14:28:45
我会建立一本字典来收集钥匙和粘贴日期。当服务器返回一个我们已经知道的项(相同的键和日期)时,我们跳过它。
这在将整个设置为生成器时效果最好:
import time
import json
import requests
import logging
def scraper():
seen_items = {}
api_url = "http://scrape.pastebin.com/api_scraping.php"
while True:
try:
response = requests.get(api_url, {'limit': 5})
for item in response.json():
last_known_date = seen_items.get(item['key'])
if item['date'] != last_known_date:
seen_items[item['key']] = item['date']
yield item
time.sleep(2.5)
except json.JSONDecodeError as e:
logging.error(f"Server response: {response.text}")
return现在我们可以迭代这些项,就像它们是一个列表一样:
for item in scraper():
print(item)待办事项
except Exception,这太通用了。seen_items更智能的计时机制,通过将移出函数并将其存储在某个地方来添加持久性。
https://stackoverflow.com/questions/60940234
复制相似问题