文章/答案/技术大牛

发布

社区首页 >问答首页 >在具有不同df名称的for循环中创建多个火花数据格式

问在具有不同df名称的for循环中创建多个火花数据格式
EN

Stack Overflow用户

提问于 2022-08-15 15:15:30

回答 1查看 108关注 0票数 0

我有一个函数，它可以从存储在AWS S3中的json中创建一个，并且我需要迭代S3中可能的文件夹列表(下面是简化的代码)。

def create_glue_df(table):
    df = glueContext.create_dynamic_frame.from_options(
    format_options={"jsonPath": "$._airbyte_data", "multiline": True},
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [f"s3://bucketx/raw-data/test/{table}/"],
        "recurse": True,
    },
    transformation_ctx=f"{table}",
    )
    
    df = df.toDF()
    
    return df

然后我有一张可能的桌子列表：

table_list = ['orders', 'products', 'price', 'inventory', etc.]

我正在尝试创建一个for循环来迭代表列表，并通过函数传递迭代器来为每个可能的表创建一个数据帧。不是每个客户端都有相同数量的表，但是名称是相等的：

for table in table_list :
    try:
        df = create_glue_df(table)
        print(f'Table {table} created.')
    except:
        print(f'Table {table} not found.')

我已经在堆栈溢出中搜索了如何将迭代器传递给变量名，有人说使用globals() (但不推荐)，另一些则使用dict。但是当我尝试dict方法时，我无法得到火花df。

在迭代器之后，我可以使用相应的后缀访问每个df：

“订单”将创建一个df_orders‘产品’将创建一个df_produts

我有办法做到这一点(不使用熊猫)吗？

python

apache-spark

pyspark

回答 1

Stack Overflow用户

发布于 2022-08-15 16:28:19

你想要很多数据。

把它们保存在一个小块里。

dfs = {}

def create_glue_df(dfs: dict, table: str) -> None:
    ...
    dfs[table] = df.toDF()
    return  # (clearly you don't need this line -- it merely emphasizes that the helper is only invoked for side effects)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73362984

复制

相似问题

问在具有不同df名称的for循环中创建多个火花数据格式
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在具有不同df名称的for循环中创建多个火花数据格式EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在具有不同df名称的for循环中创建多个火花数据格式
EN