首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >GraphFrames api支持创建二部图吗?

GraphFrames api支持创建二部图吗?
EN

Stack Overflow用户
提问于 2016-04-13 14:37:26
回答 1查看 591关注 0票数 0

GraphFrames api支持在当前版本中创建二分图吗?

当前版本: 0.1.0

火花版本: 1.6.1

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-04-21 04:42:00

正如对这个问题的评论中所指出的,GraphFrames和GraphX都没有内置对二分图的支持。但是,它们都有足够的灵活性,可以让您创建二分图。有关GraphX解决方案,请参见this previous answer。该解决方案在不同的顶点/对象类型之间使用共享特征。虽然这适用于RDDs,但对DataFrames不起作用。DataFrame中的行有固定的模式--有时不能包含price列,有时不能包含。它可以有一个price列,有时是null,但是该列必须存在于每一行中。

相反,GraphFrames的解决方案似乎是,您需要定义一个DataFrame,它本质上是二分图中两种类型对象的线性子类型--它必须包含两种类型对象的所有字段。这其实很容易--一个带有full_outerfull_outer会给你这个结果的。就像这样:

代码语言:javascript
复制
val players = Seq(
  (1,"dave", 34),
  (2,"griffin", 44)
).toDF("id", "name", "age")

val teams = Seq(
  (101,"lions","7-1"),
  (102,"tigers","5-3"),
  (103,"bears","0-9")
).toDF("id","team","record")

然后,您可以创建这样一个超级集DataFrame

代码语言:javascript
复制
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
  teams.withColumnRenamed("id", "r_id"),
  $"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
 .drop($"r_id")
 .withColumnRenamed("l_id", "id")

teamPlayer.show

+---+-------+----+------+------+
| id|   name| age|  team|record|
+---+-------+----+------+------+
|101|   null|null| lions|   7-1|
|102|   null|null|tigers|   5-3|
|103|   null|null| bears|   0-9|
|  1|   dave|  34|  null|  null|
|  2|griffin|  44|  null|  null|
+---+-------+----+------+------+

你可以用structs做一点清洁

代码语言:javascript
复制
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
  teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
  $"l_id" === $"r_id",
  "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
 .drop($"r_id")
 .withColumnRenamed("l_id", "id")

tpStructs.show

+---+------------+------------+
| id|      player|        team|
+---+------------+------------+
|101|        null| [lions,7-1]|
|102|        null|[tigers,5-3]|
|103|        null| [bears,0-9]|
|  1|   [dave,34]|        null|
|  2|[griffin,44]|        null|
+---+------------+------------+

我还将指出,在GraphXRDDs中,大致相同的解决方案是有效的。您可以通过连接两个不共享任何case classestraits来创建顶点

代码语言:javascript
复制
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
  (1L, Player("date", 34)),
  (2L, Player("griffin", 44))
))

case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
  (101L, Team("lions", "7-1")),
  (102L, Team("tigers", "5-3")),
  (103L, Team("bears", "0-9"))
))

playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))

就前面的答案而言,这似乎是一种更灵活的处理方法--无需在组合的对象之间共享trait

票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/36601769

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档