首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从网页获取所有压缩文件的url

从网页获取所有压缩文件的url
EN

Stack Overflow用户
提问于 2022-01-04 16:26:23
回答 1查看 91关注 0票数 -1

我在看网页,里面有很多压缩文件。

每个zip文件都有url作为https://www.ercot.com/misdownload/servlets/mirDownload?mimic_duns=000000000&doclookupId=814778337

我希望只提取_csv.zip文件的urls,并将文件解压缩到csv文件中,并丢弃_xml.zip文件的urls。xml.zipcsv.zip都有相同的数据,但我更喜欢使用csv.zip

我不知道如何处理这个问题,也不知道从哪里开始。

编辑:

如果您正在“访问拒绝”,请注意,该网页可能只能访问美国的IP地址。

当您单击urls时,它会将一个压缩文件下载到PC。我基本上想:

  1. 将压缩文件下载到PC。
  2. 将zip中csv文件的内容加载到大熊猫数据
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-08 11:51:01

所有的压缩文件和合并的csv文件(21 MB)都是这里,所以不需要刮擦。

但如果你喜欢的话,这是我的看法。

代码语言:javascript
复制
import os.path
from shutil import copyfileobj

import pandas as pd
import requests
from bs4 import BeautifulSoup

base_url = "https://www.ercot.com"
entry_url = f"{base_url}/misapp/GetReports.do?reportTypeId=12331&reportTitle=DAM%20Settlement%20Point%20Prices&showHTMLView=&mimicKey"
download_dir = "ercot"


def scrape_zips():
    with requests.Session() as connection:
        print("Finding all zip files...")
        zip_urls = [
            f"{base_url}{source_url['href']}" for source_url in
            BeautifulSoup(
                connection.get(entry_url).text,
                "lxml"
            ).find_all("a")[::2]
        ]

        os.makedirs(download_dir, exist_ok=True)
        total_urls = len(zip_urls)
        for idx, url in enumerate(zip_urls, start=1):
            file_name = url.split("=", -1)[-1]
            zip_object = connection.get(url, stream=True)
            print(f"Fetching file {file_name} -> {idx} out of {total_urls}")
            with open(os.path.join(download_dir, f"{file_name}.zip"), "wb") as output:
                copyfileobj(zip_object.raw, output)
            zip_object.close()


def list_files(dir_name: str):
    yield from (
        next(os.walk(dir_name), (None, None, []))[2]
    )


def merge_zips_to_df():
    print("Merging csv files...")
    df = pd.concat(
        pd.read_csv(os.path.join(download_dir, csv_file)) for csv_file
        in list_files(download_dir)
    )
    print(df.head(20))
    df.to_csv(os.path.join(download_dir, "merdged_csv_files.csv"), index=False)


if __name__ == "__main__":
    scrape_zips()
    merge_zips_to_df()

这将为您提供以下输出:

代码语言:javascript
复制
Finding all zip files...
Fetching file 816055622 -> 1 out of 31
Fetching file 815870449 -> 2 out of 31
Fetching file 815686938 -> 3 out of 31
Fetching file 815503551 -> 4 out of 31
Fetching file 815315296 -> 5 out of 31
Fetching file 815127892 -> 6 out of 31
Fetching file 814952388 -> 7 out of 31
Fetching file 814778337 -> 8 out of 31
Fetching file 814599101 -> 9 out of 31
Fetching file 814416972 -> 10 out of 31
Fetching file 814224618 -> 11 out of 31
Fetching file 814040277 -> 12 out of 31
Fetching file 813865857 -> 13 out of 31
Fetching file 813688802 -> 14 out of 31
Fetching file 813516414 -> 15 out of 31
Fetching file 813341752 -> 16 out of 31
Fetching file 813159478 -> 17 out of 31
Fetching file 812976112 -> 18 out of 31
Fetching file 812784659 -> 19 out of 31
Fetching file 812599985 -> 20 out of 31
Fetching file 812424952 -> 21 out of 31
Fetching file 812241625 -> 22 out of 31
Fetching file 812053445 -> 23 out of 31
Fetching file 811874015 -> 24 out of 31
Fetching file 811685701 -> 25 out of 31
Fetching file 811501577 -> 26 out of 31
Fetching file 811319918 -> 27 out of 31
Fetching file 811147926 -> 28 out of 31
Fetching file 810973966 -> 29 out of 31
Fetching file 810793357 -> 30 out of 31
Fetching file 810615891 -> 31 out of 31
Merging csv files...
   DeliveryDate HourEnding SettlementPoint  SettlementPointPrice DSTFlag
0    12/22/2021      01:00            AEEC                 25.07       N
1    12/22/2021      01:00     AJAXWIND_RN                 25.07       N
2    12/22/2021      01:00    ALGOD_ALL_RN                 25.01       N
3    12/22/2021      01:00        ALVIN_RN                 24.11       N
4    12/22/2021      01:00     AMADEUS_ALL                 25.07       N
5    12/22/2021      01:00     AMISTAD_ALL                 25.06       N
6    12/22/2021      01:00    AMOCOOIL_CC1                 25.98       N
7    12/22/2021      01:00    AMOCOOIL_CC2                 25.98       N
8    12/22/2021      01:00      AMOCO_PUN1                 25.98       N
9    12/22/2021      01:00      AMOCO_PUN2                 25.98       N
10   12/22/2021      01:00     AMO_AMOCO_1                 25.98       N
11   12/22/2021      01:00     AMO_AMOCO_2                 25.98       N
12   12/22/2021      01:00     AMO_AMOCO_5                 25.98       N
13   12/22/2021      01:00    AMO_AMOCO_G1                 25.98       N
14   12/22/2021      01:00    AMO_AMOCO_G2                 25.98       N
15   12/22/2021      01:00    AMO_AMOCO_G3                 25.98       N
16   12/22/2021      01:00    AMO_AMOCO_S1                 25.98       N
17   12/22/2021      01:00    AMO_AMOCO_S2                 25.98       N
18   12/22/2021      01:00    ANACACHO_ANA                 25.05       N
19   12/22/2021      01:00      ANCHOR_ALL                 25.08       N
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70581947

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档