假设有一个表单的:
id col1 col2 col3 col4
------------------------
as1 4 10 4 6
as2 6 3 6 1
as3 6 0 2 1
as4 8 8 6 1
as5 9 6 6 9是否有一种方法可以在pyspark的col2-4中搜索col1中的值,并返回(id行名,列名)?例如:
In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)提示:假设col1将是一个集合{4,6,8,9},即唯一
发布于 2019-03-07 02:22:35
是的,您可以利用Spark .isin操作符。
让我们首先在示例中创建DataFrame
第1部分-创建DataFrame
cSchema = StructType([StructField("id", IntegerType()),\
StructField("col1", IntegerType()),\
StructField("col2", IntegerType()),\
StructField("col3", IntegerType()),\
StructField("col4", IntegerType())])
test_data = [[1,4,10,4,6],[2,6,3,6,1],[3,6,0,2,1],[4,8,8,6,1],[5,9,6,6,9]]
df = spark.createDataFrame(test_data,schema=cSchema)
df.show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 4| 10| 4| 6|
| 2| 6| 3| 6| 1|
| 3| 6| 0| 2| 1|
| 4| 8| 8| 6| 1|
| 5| 9| 6| 6| 9|
+---+----+----+----+----+第2部分-Function用于搜索匹配值
isin:一个布尔表达式,如果该表达式的值包含在参数的求值值中,则该表达式求值为true。http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
def search(col1,col3):
col1_list = df.select(col1).rdd\
.map(lambda x: x[0]).collect()
search_results = df[df[col3].isin(col1_list)]
return search_results
search_results.show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 4| 10| 4| 6|
| 2| 6| 3| 6| 1|
| 4| 8| 8| 6| 1|
| 5| 9| 6| 6| 9|
+---+----+----+----+----+这应该会指引你走向正确的方向。您可以只为Id列选择等等。或者你想要回来的任何东西。该函数可以很容易地更改为接受更多的列来搜索。希望这能有所帮助!
发布于 2019-03-30 22:08:17
# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
StructField("col1", IntegerType()),
StructField("col2", IntegerType()),
StructField("col3", IntegerType()),
StructField("col4", IntegerType())])
test_data = [['as1', 4, 10, 4, 6],
['as2', 6, 3, 6, 1],
['as3', 6, 0, 2, 1],
['as4', 8, 8, 6, 1],
['as5', 9, 6, 6, 9]]
# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)
df.show()
# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']
# implement the search of values in rest columns found in col 1
def search(distinct_list ):
for i in distinct_list :
print(str(i) + ' found in: ')
# for col in df.columns:
for col in col_list:
df_search = df.select(*col_list) \
.filter(df[str(col)] == str(i))
if (len(df_search.head(1)) > 0):
df_search.show()
search(distinct_list)在GITHUB找到完整的示例代码
Output:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|as1| 4| 10| 4| 6|
|as2| 6| 3| 6| 1|
|as3| 6| 0| 2| 1|
|as4| 8| 8| 6| 1|
|as5| 9| 6| 6| 9|
+---+----+----+----+----+
6 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5| 6| 6| 9|
+---+----+----+----+
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as2| 3| 6| 1|
|as4| 8| 6| 1|
|as5| 6| 6| 9|
+---+----+----+----+
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1| 10| 4| 6|
+---+----+----+----+
9 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5| 6| 6| 9|
+---+----+----+----+
4 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1| 10| 4| 6|
+---+----+----+----+
8 found in:
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as4| 8| 6| 1|
+---+----+----+----+
https://stackoverflow.com/questions/55031126
复制相似问题