我刚接触pyspark,正在尝试了解PageRank是如何工作的。我在Cloudera上使用Jupyter中的Spark 1.6。我的顶点和边(以及模式)的屏幕截图位于以下链接中:verticesRDD和edgesRDD
到目前为止,我有如下代码:
#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *
#Read the csv files
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")
#Renaming the id columns to enable GraphFrame
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")
#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed
#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)现在当我运行pageRank函数时:
g.pageRank(resetProbability=0.15, maxIter=10)Py4JJavaError:调用o98.run时出错。语法: org.apache.spark.SparkException:作业由于阶段失败而中止:阶段79.0中的任务0失败了1次,最近一次失败:阶段79.0中丢失的任务0.0 (TID2637,localhost):scala.MatchError:null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]
results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")Py4JJavaError:调用o166run.:org.graphframes.NoSuchVertexException: GraphFrame算法时出错,因为给定的顶点ID在图形中不存在。GraphFrame(v: id : int,name: string,lat: double,long: double,dockcount: int,landmark: string,installation: string,e:src: string,dst: string,id: int,Duration: int,Start Date: string,Start Type: int,End Date: string,End Type: int,Bike #:int,Subscriber Type: string,Zip Code: string)中不包含顶点ID id
ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()AttributeError:“function”对象没有特性“”resetProbability“”
ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()Py4JJavaError:调用o188时出错。语法: org.apache.spark.SparkException:作业由于阶段失败而中止:阶段90.0中的任务0失败了1次,最近一次失败:阶段90.0中丢失的任务0.0 (TID2641,本地主机):scala.MatchError:null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]
我正在读PageRank,但是不明白我哪里错了..任何帮助我们都将不胜感激。
发布于 2018-05-30 12:38:14
问题是我如何定义我的顶点。我将"station_id“重命名为"id",而实际上,它必须是"name”。
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")必须是
verticesRDD = verticesRDD.withColumnRenamed("name", "id")pageRank在此更改后可以正常工作!
https://stackoverflow.com/questions/50524656
复制相似问题