文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫从多个列合并连接组

问熊猫从多个列合并连接组
EN

Stack Overflow用户

提问于 2022-02-09 09:42:56

回答 2查看 64关注 0票数 4

如何对至少具有一个共同值的行进行分组？我可以将多个列传递给groupby，但我希望考虑其中的任何一列，而不是所有列。

样本代码：

import pandas as pd

input = pd.DataFrame({
  'fruit': ['peach', 'banana', pd.NA, 'peach', 'apple', 'avocado', pd.NA],
  'vegetable': [pd.NA, pd.NA, 'zucchini', pd.NA, pd.NA, pd.NA, 'potato'],
  'sugar': [17, 17, 2, 18, 20, pd.NA, 4],
  'color': ['orange', 'yellow', 'green', 'orange', 'red', 'green', 'brown']
})

output = input.groupby(['fruit', 'vegetable', 'sugar', 'color']).agg({
  'fruit': lambda x: list(set(x)),
  'vegetable': lambda x: list(set(x)),
  'sugar': lambda x: list(set(x)),
  'color': lambda x: list(set(x))
})

输入：

    fruit       vegetable   sugar   color
0   peach                   17      orange
1   banana                  17      yellow
2               zucchini    2       green
3   peach                   18      orange
4   apple                   20      red
5   avocado                         green
6               potato      4       brown

预期产出：

    fruit               vegetable   sugar       color
0   [peach, banana]     []          [17, 18]    [orange, yellow]
1   [avocado]           [zucchini]  [2]         [green]
2   [apple]             []          [20]        [red]
3   []                  [potato]    [4]         [brown]

python

pandas

dataframe

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-09 10:49:39

你的问题似乎是一个图问题。

查找每列的组

首先，让我们看看哪些行按列分组。

from itertools import combinations, chain
groups = {col:
          list(chain.from_iterable(list(combinations(x, 2))
                                   for x in df.index.groupby(df[col]).values()
                                   if len(x)>1))
          for col in df.columns}

# {'fruit': [(0, 3)],
#  'vegetable': [],
#  'sugar': [(0, 1)],
#  'color': [(2, 5), (0, 3)]}

所以，在这里，我们想将组(0,3)和(0,1)合并为0是常见的。

得到连接的部件

让我们使用networkx

import networkx as nx

edges = list(chain.from_iterable(groups.values()))

G = nx.from_edgelist(edges)
G.add_nodes_from(df.index)

combined_groups = {k:v for v,l in enumerate(nx.connected_components(G))
                   for k in l}
# index: group_id
# {0: 0, 1: 0, 3: 0, 2: 1, 5: 1, 4: 2, 6: 3}

最后一群

df.groupby(df.index.map(combined_groups)).agg(list)

产出：

                    fruit           vegetable         sugar                     color
0  [peach, banana, peach]  [<NA>, <NA>, <NA>]  [17, 17, 18]  [orange, yellow, orange]
1         [<NA>, avocado]    [zucchini, <NA>]     [2, <NA>]            [green, green]
2                 [apple]              [<NA>]          [20]                     [red]
3                  [<NA>]            [potato]           [4]                   [brown]

票数 2

Stack Overflow用户

发布于 2022-02-09 17:34:11

在@mozway的“挑战”之后，我试图给出一个机会，这是我在上述问题上的“艰难之路”：

import pandas as pd

df = pd.DataFrame({
    'fruit': ['peach', 'banana', pd.NA, 'peach', 'apple', 'avocado', pd.NA],
    'vegetable': [pd.NA, pd.NA, 'zucchini', pd.NA, pd.NA, pd.NA, 'potato'],
    'sugar': [17, 17, 2, 18, 20, pd.NA, 4],
    'color': ['orange', 'yellow', 'green', 'orange', 'red', 'green', 'brown']
})

# Replacing pd.NA with empty string
df.replace(pd.NA, '', inplace=True)

# Initializing output dataframe with the columns of input dataframe
output_df = df2 = pd.DataFrame(data=None, columns=df.columns)

# Converting all the values per each row to list and removing empty values from it
df['common'] = df.values.tolist()
df['common'] = df['common'].apply(lambda x: list(filter(lambda y: y != '', x)))

check_list = set()
index = 0

# Iterating over each row of df
for i, row_i in df.iterrows():
    # Check if the index of row_i is not in check_list
    if i not in check_list:
        # Looping through the columns and setting up the output_df with each value as a list
        for column in df.columns[:-1]:
            temp = str(df.at[i, column])
            output_df.at[index, column] = [temp] if temp else []
        # Looping throught the dataframe again to compare the 'common' values
        for j, row_j in df.iterrows():
            # Check if index of row_j is not in check_list and not matches with the index of row_i
            if i != j and j not in check_list:
                # Check the common values between row_i and row_j
                # if found, update the output_df and append the values into the already defined list
                if set(row_i['common']).intersection(row_j['common']):
                    for column in df.columns[:-1]:
                        temp = str(df.at[j, column])
                        # Avoid the duplicate values in the column
                        if temp and temp not in output_df.at[index, column]:
                            output_df.at[index, column].append(temp)
                    check_list.add(j)
        # Increment the index of output_df
        index += 1

print(output_df)

这是输出，我得到：

             fruit   vegetable     sugar             color
0  [peach, banana]          []  [17, 18]  [orange, yellow]
1        [avocado]  [zucchini]       [2]           [green]
2          [apple]          []      [20]             [red]
3               []    [potato]       [4]           [brown]

但是无论如何，给我的答案是微妙的，简单的，简短的。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71047209

复制

相似问题

问熊猫从多个列合并连接组
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫从多个列合并连接组EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫从多个列合并连接组
EN