我有一个包含两个数组列的Spark dataframe:
+------------------------------------------------------+-----------------+
| var1| var2|
+------------------------------------------------------+-----------------+
| [black tea, green tea, tea, yerba mate, oolong]| [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
+------------------------------------------------------+-----------------+我需要根据下列规则删除副本:
var2列的值检查列var1的每个元素,并从var1中删除以下单词:部分(例如,1 word - tea)或完全E 210(例如,两个单词-- green tea)与d12值匹配。H 213H 114如果有完全匹配,并且该元素完全从var1列中删除,那么额外的逗号(数组内或末尾)也必须删除H 216中的重复单词>D18/18>列中的重复单词。例如,如果一个元素包含一个单词,然后在其他元素中重复,则应该删除这些重复项(例如,我们有一个milk, toned, standardised, full cream),然后是toned milk、standardized milk、full cream milk --在本例中,所需的输出如下所示:toned milk)
所需产出:
+---------------------------------------+-----------------+
| var1| var2|
+---------------------------------------+-----------------+
| [black, yerba mate, oolong]| [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
+---------------------------------------+-----------------+发布于 2022-08-01 09:16:46
这里有一种使用数组高阶函数的方法:
var2压缩为单字数组,然后使用transform on数组var1删除数组var2中对应的每个单词。最后,过滤空字符串元素。var1,并使用regex删除重复单词,然后再分割得到数组。
from pyspark.sql import functions as F
df1 = df.withColumn(
"regex",
F.concat_ws("|", F.flatten(F.transform("var2", lambda x: F.split(x, "\\s+"))))
).withColumn(
"var1",
F.filter(F.expr("transform(var1, x -> regexp_replace(x, regex, ''))"), lambda x: F.trim(x) != "")
).withColumn(
"var1",
F.regexp_replace(F.array_join(F.reverse("var1"), "#"), r"\b(\w+)\b(?=.*\b\1\b)", "")
).withColumn(
"var1",
F.transform(F.reverse(F.split("var1", "#")), lambda x: F.trim(x))
).drop("regex")使用此示例df:
df = spark.createDataFrame([
(["black tea", "green tea", "tea", "yerba mate", "oolong"], ["green tea"]),
(["milk", "toned milk", "standardised milk", "full cream milk"], ["cow or buffalo"])
], ["var1", "var2"])你会得到
df1.show(truncate=False)
# +---------------------------------------+----------------+
# |var1 |var2 |
# +---------------------------------------+----------------+
# |[black, yerba mate, oolong] |[green tea] |
# |[milk, toned, standardised, full cream]|[cow or buffalo]|
# +---------------------------------------+----------------+发布于 2022-08-01 07:25:16
绝对是大草原野牛,而不是牛:-)
df = (
#Split var1 and var2 into single words contained in a list and store in temp columns
df.select('*',*[split(regexp_replace(col(x).cast('string'),'\]|\[|\,',''),'\s').alias(f'{x}_1') for x in df.columns])
#Leverage the rich array functions to remove words that exists in var2 from var1
.withColumn('var1', array_except('var1','var2_1'))
.withColumn('var1', array_except('var1','var2'))
).select('var1','var2')
df.show(truncate=False)
df.show(truncate=False)
+------------------------------------------------------+----------------+
|var1 |var2 |
+------------------------------------------------------+----------------+
|[black, yerba mate, oolong] |[green tea] |
|[milk, toned milk, standardised milk, full cream milk]|[cow or buffalo]|
+------------------------------------------------------+----------------+https://stackoverflow.com/questions/73189710
复制相似问题