文章/答案/技术大牛

发布

社区首页 >问答首页 >从网页获取所有压缩文件的url

问从网页获取所有压缩文件的url
EN

Stack Overflow用户

提问于 2022-01-04 16:26:23

回答 1查看 91关注 0票数 -1

我在看网页，里面有很多压缩文件。

每个zip文件都有url作为https://www.ercot.com/misdownload/servlets/mirDownload?mimic_duns=000000000&doclookupId=814778337

我希望只提取_csv.zip文件的urls，并将文件解压缩到csv文件中，并丢弃_xml.zip文件的urls。xml.zip和csv.zip都有相同的数据，但我更喜欢使用csv.zip。

我不知道如何处理这个问题，也不知道从哪里开始。

编辑：

如果您正在“访问拒绝”，请注意，该网页可能只能访问美国的IP地址。

当您单击urls时，它会将一个压缩文件下载到PC。我基本上想：

将压缩文件下载到PC。
将zip中csv文件的内容加载到大熊猫数据

beautifulsoup

python-3.8

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-08 11:51:01

所有的压缩文件和合并的csv文件(21 MB)都是这里，所以不需要刮擦。

但如果你喜欢的话，这是我的看法。

import os.path
from shutil import copyfileobj

import pandas as pd
import requests
from bs4 import BeautifulSoup

base_url = "https://www.ercot.com"
entry_url = f"{base_url}/misapp/GetReports.do?reportTypeId=12331&reportTitle=DAM%20Settlement%20Point%20Prices&showHTMLView=&mimicKey"
download_dir = "ercot"


def scrape_zips():
    with requests.Session() as connection:
        print("Finding all zip files...")
        zip_urls = [
            f"{base_url}{source_url['href']}" for source_url in
            BeautifulSoup(
                connection.get(entry_url).text,
                "lxml"
            ).find_all("a")[::2]
        ]

        os.makedirs(download_dir, exist_ok=True)
        total_urls = len(zip_urls)
        for idx, url in enumerate(zip_urls, start=1):
            file_name = url.split("=", -1)[-1]
            zip_object = connection.get(url, stream=True)
            print(f"Fetching file {file_name} -> {idx} out of {total_urls}")
            with open(os.path.join(download_dir, f"{file_name}.zip"), "wb") as output:
                copyfileobj(zip_object.raw, output)
            zip_object.close()


def list_files(dir_name: str):
    yield from (
        next(os.walk(dir_name), (None, None, []))[2]
    )


def merge_zips_to_df():
    print("Merging csv files...")
    df = pd.concat(
        pd.read_csv(os.path.join(download_dir, csv_file)) for csv_file
        in list_files(download_dir)
    )
    print(df.head(20))
    df.to_csv(os.path.join(download_dir, "merdged_csv_files.csv"), index=False)


if __name__ == "__main__":
    scrape_zips()
    merge_zips_to_df()

这将为您提供以下输出：

Finding all zip files...
Fetching file 816055622 -> 1 out of 31
Fetching file 815870449 -> 2 out of 31
Fetching file 815686938 -> 3 out of 31
Fetching file 815503551 -> 4 out of 31
Fetching file 815315296 -> 5 out of 31
Fetching file 815127892 -> 6 out of 31
Fetching file 814952388 -> 7 out of 31
Fetching file 814778337 -> 8 out of 31
Fetching file 814599101 -> 9 out of 31
Fetching file 814416972 -> 10 out of 31
Fetching file 814224618 -> 11 out of 31
Fetching file 814040277 -> 12 out of 31
Fetching file 813865857 -> 13 out of 31
Fetching file 813688802 -> 14 out of 31
Fetching file 813516414 -> 15 out of 31
Fetching file 813341752 -> 16 out of 31
Fetching file 813159478 -> 17 out of 31
Fetching file 812976112 -> 18 out of 31
Fetching file 812784659 -> 19 out of 31
Fetching file 812599985 -> 20 out of 31
Fetching file 812424952 -> 21 out of 31
Fetching file 812241625 -> 22 out of 31
Fetching file 812053445 -> 23 out of 31
Fetching file 811874015 -> 24 out of 31
Fetching file 811685701 -> 25 out of 31
Fetching file 811501577 -> 26 out of 31
Fetching file 811319918 -> 27 out of 31
Fetching file 811147926 -> 28 out of 31
Fetching file 810973966 -> 29 out of 31
Fetching file 810793357 -> 30 out of 31
Fetching file 810615891 -> 31 out of 31
Merging csv files...
   DeliveryDate HourEnding SettlementPoint  SettlementPointPrice DSTFlag
0    12/22/2021      01:00            AEEC                 25.07       N
1    12/22/2021      01:00     AJAXWIND_RN                 25.07       N
2    12/22/2021      01:00    ALGOD_ALL_RN                 25.01       N
3    12/22/2021      01:00        ALVIN_RN                 24.11       N
4    12/22/2021      01:00     AMADEUS_ALL                 25.07       N
5    12/22/2021      01:00     AMISTAD_ALL                 25.06       N
6    12/22/2021      01:00    AMOCOOIL_CC1                 25.98       N
7    12/22/2021      01:00    AMOCOOIL_CC2                 25.98       N
8    12/22/2021      01:00      AMOCO_PUN1                 25.98       N
9    12/22/2021      01:00      AMOCO_PUN2                 25.98       N
10   12/22/2021      01:00     AMO_AMOCO_1                 25.98       N
11   12/22/2021      01:00     AMO_AMOCO_2                 25.98       N
12   12/22/2021      01:00     AMO_AMOCO_5                 25.98       N
13   12/22/2021      01:00    AMO_AMOCO_G1                 25.98       N
14   12/22/2021      01:00    AMO_AMOCO_G2                 25.98       N
15   12/22/2021      01:00    AMO_AMOCO_G3                 25.98       N
16   12/22/2021      01:00    AMO_AMOCO_S1                 25.98       N
17   12/22/2021      01:00    AMO_AMOCO_S2                 25.98       N
18   12/22/2021      01:00    ANACACHO_ANA                 25.05       N
19   12/22/2021      01:00      ANCHOR_ALL                 25.08       N

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70581947

复制

相似问题

问从网页获取所有压缩文件的url
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从网页获取所有压缩文件的urlEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从网页获取所有压缩文件的url
EN