文章/答案/技术大牛

发布

社区首页 >问答首页 >在的其余列中搜索column1中的值

问在的其余列中搜索column1中的值
EN

Stack Overflow用户

提问于 2019-03-06 19:49:41

回答 2查看 256关注 0票数 2

假设有一个表单的：

id  col1  col2 col3 col4
------------------------
as1  4    10    4    6
as2  6    3     6    1
as3  6    0     2    1
as4  8    8     6    1
as5  9    6     6    9

是否有一种方法可以在pyspark的col2-4中搜索col1中的值，并返回(id行名，列名)？例如：

In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)

提示:假设col1将是一个集合{4,6,8,9}，即唯一

python

pyspark

回答 2

Stack Overflow用户

发布于 2019-03-07 02:22:35

是的，您可以利用Spark .isin操作符。

让我们首先在示例中创建DataFrame

第1部分-创建DataFrame

cSchema = StructType([StructField("id", IntegerType()),\
StructField("col1", IntegerType()),\
StructField("col2", IntegerType()),\
StructField("col3", IntegerType()),\
StructField("col4", IntegerType())])


test_data = [[1,4,10,4,6],[2,6,3,6,1],[3,6,0,2,1],[4,8,8,6,1],[5,9,6,6,9]]


df = spark.createDataFrame(test_data,schema=cSchema)

df.show()

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  1|   4|  10|   4|   6|
|  2|   6|   3|   6|   1|
|  3|   6|   0|   2|   1|
|  4|   8|   8|   6|   1|
|  5|   9|   6|   6|   9|
+---+----+----+----+----+

第2部分-Function用于搜索匹配值

isin：一个布尔表达式，如果该表达式的值包含在参数的求值值中，则该表达式求值为true。http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

def search(col1,col3):
    col1_list = df.select(col1).rdd\
    .map(lambda x: x[0]).collect()
    search_results = df[df[col3].isin(col1_list)]
    return search_results

search_results.show()

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|  1|   4|  10|   4|   6|
|  2|   6|   3|   6|   1|
|  4|   8|   8|   6|   1|
|  5|   9|   6|   6|   9|
+---+----+----+----+----+

这应该会指引你走向正确的方向。您可以只为Id列选择等等。或者你想要回来的任何东西。该函数可以很容易地更改为接受更多的列来搜索。希望这能有所帮助！

票数 1

Stack Overflow用户

发布于 2019-03-30 22:08:17

# create structfield using array list
cSchema = StructType([StructField("id", StringType()),
                      StructField("col1", IntegerType()),
                      StructField("col2", IntegerType()),
                      StructField("col3", IntegerType()),
                      StructField("col4", IntegerType())])

test_data = [['as1', 4, 10, 4, 6],
             ['as2', 6, 3, 6, 1],
             ['as3', 6, 0, 2, 1],
             ['as4', 8, 8, 6, 1],
             ['as5', 9, 6, 6, 9]]

# create pyspark dataframe
df = spark.createDataFrame(test_data, schema=cSchema)

df.show()

# obtain the distinct items for col 1
distinct_list = [i.col1 for i in df.select("col1").distinct().collect()]
# rest columns
col_list = ['id', 'col2', 'col3', 'col4']

# implement the search of values in rest columns found in col 1
def search(distinct_list ):
    for i in distinct_list :
        print(str(i) + ' found in: ')

        # for col in df.columns:
        for col in col_list:
            df_search = df.select(*col_list) \
                .filter(df[str(col)] == str(i))

            if (len(df_search.head(1)) > 0):
                df_search.show()


search(distinct_list)

在GITHUB找到完整的示例代码

Output:

+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
|as1|   4|  10|   4|   6|
|as2|   6|   3|   6|   1|
|as3|   6|   0|   2|   1|
|as4|   8|   8|   6|   1|
|as5|   9|   6|   6|   9|
+---+----+----+----+----+

6 found in: 
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5|   6|   6|   9|
+---+----+----+----+

+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as2|   3|   6|   1|
|as4|   8|   6|   1|
|as5|   6|   6|   9|
+---+----+----+----+

+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1|  10|   4|   6|
+---+----+----+----+

9 found in: 
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as5|   6|   6|   9|
+---+----+----+----+

4 found in: 
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as1|  10|   4|   6|
+---+----+----+----+

8 found in: 
+---+----+----+----+
| id|col2|col3|col4|
+---+----+----+----+
|as4|   8|   6|   1|
+---+----+----+----+

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55031126

复制

相似问题

问在的其余列中搜索column1中的值
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在的其余列中搜索column1中的值EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在的其余列中搜索column1中的值
EN