首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >pyspark GraphFrames中的主题

pyspark GraphFrames中的主题
EN

Stack Overflow用户
提问于 2018-05-25 11:34:20
回答 1查看 481关注 0票数 0

我刚接触pyspark,正在努力从GraphFrame中寻找主题。虽然我知道顶点和边之间存在关系,但我得到的结果是空的。我在Cloudera上的Jupyter上用Spark 1.6运行这个。我的顶点和边(以及模式)的屏幕截图位于以下链接中:verticesRDDedgesRDD

我正在读GraphFrames,但没有读懂...到目前为止,我有以下代码。我哪里错了..?

代码语言:javascript
复制
#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same epty results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)

print g
#this deisplays the following:
#GraphFrame(v:[id: int, name: string, lat: double, long: double, dockcount: int, landmark: string, installation: string], e:[src: string, dst: string, id: int, Duration: int, Start Date: string, Start Terminal: int, End Date: string, End Terminal: int, Bike #: int, Subscriber Type: string, Zip Code: string])

#Stations where a is connected to b
motifs = g.find("(a)-[e1]->(b)")
motifs.show()

+---+---+---+
| e1|  a|  b|
+---+---+---+
+---+---+---+

motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
motifs.show()

+---+---+---+---+
| e1|  a|  b| e2|
+---+---+---+---+
+---+---+---+---+


motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(c)")
motifs.show()

+---+---+---+---+---+
| e1|  a|  b| e2|  c|
+---+---+---+---+---+
+---+---+---+---+---+

#Stations where a is connected to b, b is connected to c 
#but c is not connected to a
motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(c)").filter("(c!=a)")
motifs.show()

+---+---+---+---+---+
| e1|  a|  b| e2|  c|
+---+---+---+---+---+
+---+---+---+---+---+
EN

回答 1

Stack Overflow用户

发布于 2018-05-30 12:18:09

问题是我如何定义我的顶点。我将"station_id“重命名为"id",而实际上,它必须是"name”。

代码语言:javascript
复制
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

必须是

代码语言:javascript
复制
verticesRDD = verticesRDD.withColumnRenamed("name", "id")

Motifs可以在这个变化中正常工作!

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50521134

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档