我有一个包含四列的数据帧: parent_serialno、child_serialno、parent_function和child_function。我想构造一个数据帧,其中每行都是根父级,每列都是一个函数,其值是该函数的序列号。
例如,数据帧如下所示:
df = pd.DataFrame(
[['001', '010', 'A', 'B'], ['001', '020', 'A', 'C'], ['010', '100', 'B', 'D'], ['100', '110', 'D', 'E'],
['002', '030', 'A', 'B'], ['002', '040', 'A', 'C']],
columns=['parent_serialno', 'child_serialno', 'parent_function', 'child_function'])请注意,并不是所有的函数都包含每个根的子代,但对于给定的根,每个函数只有一个序列号。根序列号是提前知道的。
我想要输出的数据帧类似于:
pd.DataFrame([['001','010','020','100','110'],['002','030','040', np.nan, np.nan]], columns = ['A','B','C','D','E'])
Out[1]:
A B C D E
0 001 010 020 100 110
1 002 030 040 NaN NaNThis post展示了如何获得字典层次结构,但我不太关心识别树叶在树中的位置(即孙子和曾孙),而更关心的是识别每个叶子的根和功能。
发布于 2021-10-21 17:48:12
使用networkx解决此问题:
# Python env: pip install networkx
# Anaconda env: conda install networkx
# Create a list of tuples of serialno / function
df['parent'] = df[['parent_function', 'parent_serialno']].apply(tuple, axis=1)
df['child'] = df[['child_function', 'child_serialno']].apply(tuple, axis=1)
# Create a directed graph from dataframe
G = nx.from_pandas_edgelist(df, source='parent', target='child',
create_using=nx.DiGraph)
# Find roots and leaves
roots = [node for node, degree in G.in_degree() if degree == 0]
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all path from each root to each leaf
paths = {}
for root in roots:
children = paths.setdefault(root, [])
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
children.extend(path[1:])
children.sort(key=lambda x: x[1])
# Create your final output
out = pd.DataFrame([dict([parent] + children) for parent, children in paths.items()])输出:
>>> out
A B C D E
0 001 010 020 100 110
1 002 030 040 NaN NaNhttps://stackoverflow.com/questions/69664966
复制相似问题