文章/答案/技术大牛

发布

社区首页 >问答首页 >如何将单词(在一个DataFrame中)替换为来自另一个DataFrame的匹配in？

问如何将单词(在一个DataFrame中)替换为来自另一个DataFrame的匹配in？
EN

Stack Overflow用户

提问于 2017-05-25 15:16:37

回答 1查看 198关注 0票数 1

问一问this question的一个变化，这是对Pandas的要求，我也有类似的情况，除了我正在使用spark-shell或pyspark。

我有一个dataframe，它包含一个域(顶点)列表：

index            domain
0            airbnb.com
1          facebook.com
2                st.org
3              index.co
4        crunchbase.com
5               avc.com
6        techcrunch.com
7            google.com

我有另一个dataframe，它包含这些域(边)之间的连接：

           source_domain    destination_domain
              airbnb.com            google.com
            facebook.com            google.com
                  st.org          facebook.com
                  st.org            airbnb.com
                  st.org        crunchbase.com
                index.co        techcrunch.com
          crunchbase.com        techcrunch.com
          crunchbase.com            airbnb.com
                 avc.com        techcrunch.com
          techcrunch.com                st.org
          techcrunch.com            google.com
          techcrunch.com          facebook.com

如何用域(也称为顶点)的相应索引来替换边沿中的每个单元格？因此，边沿中的第一行可能最终看起来如下：

###### Before: ##################### 
           facebook.com google.com   
###### After:  #####################   
           1            7

数据数据将增长到至少几百千兆字节。

我怎么才能在星火里这么做呢？

pyspark

apache-spark-sql

scala

apache-spark

dataframe

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-05-25 16:19:50

TL；博士将数据集分别保存为CSV文件、vertices.csv和edges.csv、read和join。

// load the datasets
val vertices = spark.read.option("header", true).csv("vertices.csv")
val edges = spark.read.option("header", true).csv("edges.csv")

// indexify the source_domain
val sources = edges.
  join(vertices).
  where(edges("source_domain") === vertices("domain")).
  withColumnRenamed("index", "source_index")

// indexify the destination_domain
val destinations = edges.
  join(vertices).
  where(edges("destination_domain") === vertices("domain")).
  withColumnRenamed("index", "destination_index")

val result = sources.
  join(destinations, Seq("source_domain", "destination_domain")).
  select("source_index", "destination_index")
scala> result.show
+------------+-----------------+
|source_index|destination_index|
+------------+-----------------+
|           0|                7|
|           1|                7|
|           2|                1|
|           2|                0|
|           2|                4|
|           3|                6|
|           4|                6|
|           4|                0|
|           5|                6|
|           6|                2|
|           6|                7|
|           6|                1|
+------------+-----------------+

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44184008

复制

相似问题

问如何将单词(在一个DataFrame中)替换为来自另一个DataFrame的匹配in？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将单词(在一个DataFrame中)替换为来自另一个DataFrame的匹配in？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将单词(在一个DataFrame中)替换为来自另一个DataFrame的匹配in？
EN