我是刮一个网站,在那里我有一个网站链接,我将刮,并有一个大约6-8小时的更新时间。如果数据没有变化,那么它将保持不变。基本上,这意味着我不必一直单击run来查看数据是否发生了变化。
除此之外,我还想通过把文件变成csv来做一次刮擦。在这里,我附上了我在网站上做刮擦时的代码。
import csv
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.ndbc.noaa.gov/station_page.php?station=56003"
request_headers = {
"user-agent": ("Mozilla / 5.0 (Windows NT 10.0; Win64; x64)"
"AppleWebKit / 537.36 (KHTML, like Gecko)"
"Chrome / 88.0.4324.150 Safari / 537.36 Edg / 88.0.705.63")
}
response = requests.get (url, headers = request_headers)
response.raise_for_status ()
soup = BeautifulSoup (response.text, "html.parser")
headers = ["Year", "Month", "Day", "Hour", "Minute", "Second", "T", "Height"]
with open ("station-56003.csv", "w") as f:
writer = csv.writer (f, lineterminator = "\ n")
writer.writerow (headers)
for line in soup.select_one ("# data"). text.split ("\ n"):
if re.fullmatch (r "[\ d.] {30}", line) and len (line.split ()) == len (headers):
writer.writerow (line.split ()) 发布于 2021-03-04 09:10:14
考虑检查csv文件的校验和。如果它已经改变了,这意味着有新的数据。
https://stackoverflow.com/questions/66471708
复制相似问题