文章/答案/技术大牛

发布

社区首页 >问答首页 >GraphFrames的PageRank中存在错误

问GraphFrames的PageRank中存在错误
EN

Stack Overflow用户

提问于 2018-05-25 16:20:56

回答 1查看 609关注 0票数 0

我刚接触pyspark，正在尝试了解PageRank是如何工作的。我在Cloudera上使用Jupyter中的Spark 1.6。我的顶点和边(以及模式)的屏幕截图位于以下链接中：verticesRDD和edgesRDD

到目前为止，我有如下代码：

#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)

现在当我运行pageRank函数时：

g.pageRank(resetProbability=0.15, maxIter=10)

Py4JJavaError:调用o98.run时出错。语法: org.apache.spark.SparkException:作业由于阶段失败而中止:阶段79.0中的任务0失败了1次，最近一次失败:阶段79.0中丢失的任务0.0 (TID2637，localhost)：scala.MatchError：null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]

results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")

Py4JJavaError:调用o166run.：org.graphframes.NoSuchVertexException: GraphFrame算法时出错，因为给定的顶点ID在图形中不存在。GraphFrame(v: id : int，name: string，lat: double，long: double，dockcount: int，landmark: string，installation: string，e:src: string，dst: string，id: int，Duration: int，Start Date: string，Start Type: int，End Date: string，End Type: int，Bike #：int，Subscriber Type: string，Zip Code: string)中不包含顶点ID id

ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()

AttributeError：“function”对象没有特性“”resetProbability“”

ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()

Py4JJavaError:调用o188时出错。语法: org.apache.spark.SparkException:作业由于阶段失败而中止:阶段90.0中的任务0失败了1次，最近一次失败:阶段90.0中丢失的任务0.0 (TID2641，本地主机)：scala.MatchError：null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]

我正在读PageRank，但是不明白我哪里错了..任何帮助我们都将不胜感激。

pyspark

bigdata

pyspark-sql

pagerank

graphframes

回答 1

Stack Overflow用户

发布于 2018-05-30 12:38:14

问题是我如何定义我的顶点。我将"station_id“重命名为"id"，而实际上，它必须是"name”。

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

必须是

verticesRDD = verticesRDD.withColumnRenamed("name", "id")

pageRank在此更改后可以正常工作！

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50524656

复制

相似问题

问GraphFrames的PageRank中存在错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问GraphFrames的PageRank中存在错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问GraphFrames的PageRank中存在错误
EN