我有一些来自json的示例数据,这些数据看起来类似于以下内容:
{ hero: axe, attribute: strength, active_abilities: [q, w, r], inactive_abilities: e }
{ hero: invoker, attribute: intelligence, active_abilities: [q, w, e, r, f, d], inactive_abilities: null }
{ hero: phantom assassin, attribute: agility, active_abilities: [q, w, e], inactive_abilities: r }
{ hero: life stealer, attribute: strength, active_abilities: [q, r], inactive_abilities: [w, e] }我遇到的问题是,由于该列中数据类型的可变性,“inactive_abilities”列被读取为字符串。数据可能为null、单个字符串(如果只有一个能力)、一个数组(如果有多个能力)。最后,我希望有几个新的列,这些列都是基于“非活动能力”的数量。如果只有1或空能力,我想要一个新列inactive_ability,它只在有一个非活动能力时填充,如果没有或有多个非活动能力,则为null。那么我想要多个专栏,比如inactive_ability1,inactive_ability2,inactive_ability3等等.在数组持有>1值的情况下。因此,从上面的示例中,最终结果应该如下所示:
{ hero: axe, attribute: strength, active_abilities: [q, w, r], inactive_abilities: e , inactive_ability: e, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null}
{ hero: invoker, attribute: intelligence, active_abilities: [q, w, e, r, f, d], inactive_abilities: null, inactive_ability: null, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null }
{ hero: phantom assassin, attribute: agility, active_abilities: [q, w, e], inactive_abilities: r, inactive_ability: r, inactive_ability1: null, inactive_ability2: null, inactive_ability3, null, inactive_ability4: null }
{ hero: life stealer, attribute: strength, active_abilities: [q, r], inactive_abilities: [w, e], inactive_ability: null, inactive_ability1: w, inactive_ability2: e, inactive_ability3, null, inactive_ability4: null }我不能假设'inactive_abilities‘的数量是固定的,但是如果它超过了4,其余的就可以忽略了。我遇到的问题是能够将字段转换为数组,并在适当的情况下读取它,然后根据上述条件创建和填充新列。
发布于 2021-12-27 20:14:49
字符串inactive_abilities列可以通过删除[和]并通过,拆分字符串而转换为数组。
from pyspark.sql import functions as F
data = [{"hero": "axe", "attribute": "strength", "active_abilities": ["q", "w", "r"], "inactive_abilities": "e"},
{"hero": "invoker", "attribute": "intelligence", "active_abilities": ["q", "w", "e", "r", "f", "d"],
"inactive_abilities": None},
{"hero": "phantom assassin", "attribute": "agility", "active_abilities": ["q", "w", "e"],
"inactive_abilities": "r"},
{"hero": "life stealer", "attribute": "strength", "active_abilities": ["q", "r"],
"inactive_abilities": "[w, e]"}, ]
df = spark.createDataFrame(data)
array_select_expr = [
F.when(F.size("parsed_inactive_abilities") > 1, F.trim(F.col("parsed_inactive_abilities")[i])).alias(
f"inactive_ability{i}") if i > 0 else F.when(
F.size("parsed_inactive_abilities") == 1, F.trim(F.col("parsed_inactive_abilities")[0])).alias(
"inactive_abilities")
for i in range(0, 5)]
(df.withColumn("parsed_inactive_abilities", F.split(F.regexp_replace(F.col("inactive_abilities"), "[\[\]]", ""), ","))
.select("*", *array_select_expr)
.drop("parsed_inactive_abilities").show())输出
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
| active_abilities| attribute| hero|inactive_abilities|inactive_abilities|inactive_ability1|inactive_ability2|inactive_ability3|inactive_ability4|
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
| [q, w, r]| strength| axe| e| e| null| null| null| null|
|[q, w, e, r, f, d]|intelligence| invoker| null| null| null| null| null| null|
| [q, w, e]| agility|phantom assassin| r| r| null| null| null| null|
| [q, r]| strength| life stealer| [w, e]| null| e| null| null| null|
+------------------+------------+----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+
# Value in array_select_expr
[Column<'CASE WHEN (size(parsed_inactive_abilities) = 1) THEN trim(parsed_inactive_abilities[0]) END AS inactive_abilities'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[1]) END AS inactive_ability1'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[2]) END AS inactive_ability2'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[3]) END AS inactive_ability3'>,
Column<'CASE WHEN (size(parsed_inactive_abilities) > 1) THEN trim(parsed_inactive_abilities[4]) END AS inactive_ability4'>]https://stackoverflow.com/questions/70499270
复制相似问题