我有3种火花放电数据帧( df_main,df_xyz,df_cvb),其中df_main是需要根据条件创建列(New_col)的驱动程序数据帧。
l = [(1,’XYZ', '324 NW', ‘VA’), (2,’XYZ, '323 NW', ‘VA’), (3,‘CVB’, '314 NW', ‘VA’)]
df_main = spark.createDataFrame(l, (‘ID’, ’Name', 'Address', 'State'))
ID Name Address State
1 XYZ 324 NW VA
2 XYZ 323 NW VA
3 CVB 314 NW VA
l = [(1 ,10, ‘A’), (2,20, ‘B’), (4, 120, ‘C’)]
df_xyz = spark.createDataFrame(l, (‘ID’, 'col1', 'col2'))
ID col1 col2
1 10 A
2 20 B
4 120 C
l = [(1 ,56), (2,45), (3,12)]
df_cvb = spark.createDataFrame(l, (‘ID’, ‘col3’))
ID col3
1 56
2 45
3 12在new_col数据帧中创建一个列“df_main”。
如果名称=‘XYZ’,则从ID上的df_xyz中获取df_xyz值。
如果名称=‘CVB’,则从ID col3上的df_cvb中获取
因此,我的预期输出df_main数据框架如下所示
ID Name Address State new_col
1 XYZ 324 NW VA 10
2 XYZ 323 NW VA 20
3 CVB 314 NW VA 12发布于 2022-05-03 21:11:08
l = [(1,'XYZ', '324 NW', 'VA'), (2,'XYZ', '323 NW', 'VA'), (3,'CVB', '314 NW', 'VA')]
df_main = spark.createDataFrame(l, ('ID', 'Name', 'Address', 'State'))
l = [(1 ,10, 'A'), (2,20, 'B'), (4, 120, 'C')]
df_xyz = spark.createDataFrame(l, ('ID', 'col1', 'col2'))
df_xyz = df_xyz.select( df_xyz['ID'],df_xyz['col1'].alias('new_col'), lit("XYZ").alias("table") )
l = [(1 ,56), (2,45), (3,12)]
df_cvb = spark.createDataFrame(l, ('ID', 'col3'))
df_cvb = df_cvb.select( df_cvb['ID'],df_cvb['col3'].alias('new_col') , lit("CVB").alias( "table"))
cond = [ df_cvb['ID'] == df_main['ID'], df_cvb['table'] == df_main['Name']]
condxyz = [ df_xyz['ID'] == df_main['ID'], df_xyz['table'] == df_main['Name']]
df_result = df_main.join( df_cvb, cond, "inner").union(df_main.join( df_xyz,condxyz,"inner"))https://stackoverflow.com/questions/72105067
复制相似问题