首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >熊猫从多个列合并连接组

熊猫从多个列合并连接组
EN

Stack Overflow用户
提问于 2022-02-09 09:42:56
回答 2查看 64关注 0票数 4

如何对至少具有一个共同值的行进行分组?我可以将多个列传递给groupby,但我希望考虑其中的任何一列,而不是所有列。

样本代码:

代码语言:javascript
复制
import pandas as pd

input = pd.DataFrame({
  'fruit': ['peach', 'banana', pd.NA, 'peach', 'apple', 'avocado', pd.NA],
  'vegetable': [pd.NA, pd.NA, 'zucchini', pd.NA, pd.NA, pd.NA, 'potato'],
  'sugar': [17, 17, 2, 18, 20, pd.NA, 4],
  'color': ['orange', 'yellow', 'green', 'orange', 'red', 'green', 'brown']
})

output = input.groupby(['fruit', 'vegetable', 'sugar', 'color']).agg({
  'fruit': lambda x: list(set(x)),
  'vegetable': lambda x: list(set(x)),
  'sugar': lambda x: list(set(x)),
  'color': lambda x: list(set(x))
})

输入:

代码语言:javascript
复制
    fruit       vegetable   sugar   color
0   peach                   17      orange
1   banana                  17      yellow
2               zucchini    2       green
3   peach                   18      orange
4   apple                   20      red
5   avocado                         green
6               potato      4       brown

预期产出:

代码语言:javascript
复制
    fruit               vegetable   sugar       color
0   [peach, banana]     []          [17, 18]    [orange, yellow]
1   [avocado]           [zucchini]  [2]         [green]
2   [apple]             []          [20]        [red]
3   []                  [potato]    [4]         [brown]
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-09 10:49:39

你的问题似乎是一个图问题。

查找每列的组

首先,让我们看看哪些行按列分组。

代码语言:javascript
复制
from itertools import combinations, chain
groups = {col:
          list(chain.from_iterable(list(combinations(x, 2))
                                   for x in df.index.groupby(df[col]).values()
                                   if len(x)>1))
          for col in df.columns}

# {'fruit': [(0, 3)],
#  'vegetable': [],
#  'sugar': [(0, 1)],
#  'color': [(2, 5), (0, 3)]}

所以,在这里,我们想将组(0,3)和(0,1)合并为0是常见的。

得到连接的部件

让我们使用networkx

代码语言:javascript
复制
import networkx as nx

edges = list(chain.from_iterable(groups.values()))

G = nx.from_edgelist(edges)
G.add_nodes_from(df.index)

combined_groups = {k:v for v,l in enumerate(nx.connected_components(G))
                   for k in l}
# index: group_id
# {0: 0, 1: 0, 3: 0, 2: 1, 5: 1, 4: 2, 6: 3}

最后一群

代码语言:javascript
复制
df.groupby(df.index.map(combined_groups)).agg(list)

产出:

代码语言:javascript
复制
                    fruit           vegetable         sugar                     color
0  [peach, banana, peach]  [<NA>, <NA>, <NA>]  [17, 17, 18]  [orange, yellow, orange]
1         [<NA>, avocado]    [zucchini, <NA>]     [2, <NA>]            [green, green]
2                 [apple]              [<NA>]          [20]                     [red]
3                  [<NA>]            [potato]           [4]                   [brown]
票数 2
EN

Stack Overflow用户

发布于 2022-02-09 17:34:11

在@mozway的“挑战”之后,我试图给出一个机会,这是我在上述问题上的“艰难之路”:

代码语言:javascript
复制
import pandas as pd

df = pd.DataFrame({
    'fruit': ['peach', 'banana', pd.NA, 'peach', 'apple', 'avocado', pd.NA],
    'vegetable': [pd.NA, pd.NA, 'zucchini', pd.NA, pd.NA, pd.NA, 'potato'],
    'sugar': [17, 17, 2, 18, 20, pd.NA, 4],
    'color': ['orange', 'yellow', 'green', 'orange', 'red', 'green', 'brown']
})

# Replacing pd.NA with empty string
df.replace(pd.NA, '', inplace=True)

# Initializing output dataframe with the columns of input dataframe
output_df = df2 = pd.DataFrame(data=None, columns=df.columns)

# Converting all the values per each row to list and removing empty values from it
df['common'] = df.values.tolist()
df['common'] = df['common'].apply(lambda x: list(filter(lambda y: y != '', x)))

check_list = set()
index = 0

# Iterating over each row of df
for i, row_i in df.iterrows():
    # Check if the index of row_i is not in check_list
    if i not in check_list:
        # Looping through the columns and setting up the output_df with each value as a list
        for column in df.columns[:-1]:
            temp = str(df.at[i, column])
            output_df.at[index, column] = [temp] if temp else []
        # Looping throught the dataframe again to compare the 'common' values
        for j, row_j in df.iterrows():
            # Check if index of row_j is not in check_list and not matches with the index of row_i
            if i != j and j not in check_list:
                # Check the common values between row_i and row_j
                # if found, update the output_df and append the values into the already defined list
                if set(row_i['common']).intersection(row_j['common']):
                    for column in df.columns[:-1]:
                        temp = str(df.at[j, column])
                        # Avoid the duplicate values in the column
                        if temp and temp not in output_df.at[index, column]:
                            output_df.at[index, column].append(temp)
                    check_list.add(j)
        # Increment the index of output_df
        index += 1

print(output_df)

这是输出,我得到:

代码语言:javascript
复制
             fruit   vegetable     sugar             color
0  [peach, banana]          []  [17, 18]  [orange, yellow]
1        [avocado]  [zucchini]       [2]           [green]
2          [apple]          []      [20]             [red]
3               []    [potato]       [4]           [brown]

但是无论如何,给我的答案是微妙的,简单的,简短的。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71047209

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档