文章/答案/技术大牛

发布

社区首页 >问答首页 >生成合成键来映射多到多个关系

问生成合成键来映射多到多个关系
EN

Stack Overflow用户

提问于 2019-06-21 13:57:48

回答 1查看 238关注 0票数 0

在识别原始密钥之间的关系之后，我试图创建一个独特的合成密钥。

我的DataFrame：

Key   Value
K1     1
K2     2
K2     3
K1     3
K2     4
K1     5
K3     6
K4     6
K5     7

预期结果：

Key   Value   New_Key
K1     1        NK1
K2     2        NK1
K2     3        NK1
K1     3        NK1
K2     4        NK1 
K1     5        NK1 
K2     6        NK2
K3     6        NK2
K4     7        NK3

我期待着在python3.0或pyspark中得到响应。

我用这个代码试了一下：

#Import libraries# 
import networkx as nx 
import pandas as pd 
#Create DF# 
d1=pd.DataFrame({'Key','Value'}) 
#Create Empty Graph# 
G=nx.Graph() 
#Create a list of edge tuples# 
e=list(d1.iloc[0:].itertuples(index=False, name=None)) 
#Create a list of nodes/vertices# 
v=list(set(d1.A).union(set(d1.B))) 
#Add nodes and edges to the graph# 
G.add_edges_from(e) 
G.add_nodes_from(v) 
#Get list connected components# 
c=[c for c in sorted(nx.connected_components(G), key=None, reverse=False)] print(c)

提前谢谢。

python-3.x

tsql

pyspark

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-07-06 11:51:32

您想要解决的问题可以称为一个称为连通组件的图问题。您所要做的就是将您的Keys和Values视为顶点，并运行一个连接组件算法。下面向您展示了一个使用pyspark和图形帧的解决方案。

import pyspark.sql.functions as F
from graphframes import *

sc.setCheckpointDir('/tmp/graphframes')

l = [('K1' ,    1),
('K2' ,    2),
('K2' ,    3),
('K1' ,    3),
('K2' ,    4),
('K1' ,    5),
('K3' ,    6),
('K4' ,    6),
('K5' ,    7)]

columns = ['Key', 'Value']

df=spark.createDataFrame(l, columns)

#creating a graphframe 
#an edge dataframe requires a src and a dst column
edges = df.withColumnRenamed('Key', 'src')\
          .withColumnRenamed('Value', 'dst')

#a vertices dataframe requires a id column
vertices = df.select('Key').union(df.select('value')).withColumnRenamed('Key', 'id')

#this creates a graphframe...
g = GraphFrame(vertices, edges)
#which already has a function called connected components
cC = g.connectedComponents().withColumnRenamed('id', 'Key')

#now we join the connectedComponents dataframe with the original dataframe to add the new keys to it. I'm calling distinct here, as I'm currently getting multiple rows which I can't really explain at the moment
df = df.join(cC, 'Key', 'inner').distinct()
df.show()

输出：

+---+-----+------------+ 
|Key|Value|   component| 
+---+-----+------------+ 
| K3|    6|335007449088| 
| K1|    5|154618822656| 
| K1|    1|154618822656| 
| K1|    3|154618822656| 
| K2|    2|154618822656| 
| K2|    3|154618822656| 
| K2|    4|154618822656| 
| K4|    6|335007449088| 
| K5|    7| 25769803776| 
+---+-----+------------+

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56704890

复制

相似问题

问生成合成键来映射多到多个关系
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生成合成键来映射多到多个关系EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生成合成键来映射多到多个关系
EN