GraphFrames api支持在当前版本中创建二分图吗?
当前版本: 0.1.0
火花版本: 1.6.1
发布于 2016-04-21 04:42:00
正如对这个问题的评论中所指出的,GraphFrames和GraphX都没有内置对二分图的支持。但是,它们都有足够的灵活性,可以让您创建二分图。有关GraphX解决方案,请参见this previous answer。该解决方案在不同的顶点/对象类型之间使用共享特征。虽然这适用于RDDs,但对DataFrames不起作用。DataFrame中的行有固定的模式--有时不能包含price列,有时不能包含。它可以有一个price列,有时是null,但是该列必须存在于每一行中。
相反,GraphFrames的解决方案似乎是,您需要定义一个DataFrame,它本质上是二分图中两种类型对象的线性子类型--它必须包含两种类型对象的所有字段。这其实很容易--一个带有full_outer的full_outer会给你这个结果的。就像这样:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")然后,您可以创建这样一个超级集DataFrame:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+你可以用structs做一点清洁
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+我还将指出,在GraphX和RDDs中,大致相同的解决方案是有效的。您可以通过连接两个不共享任何case classes的traits来创建顶点
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))就前面的答案而言,这似乎是一种更灵活的处理方法--无需在组合的对象之间共享trait。
https://stackoverflow.com/questions/36601769
复制相似问题