投入:
Name1-可转码的Name2
阿尔琼-再发
日基尔
安SHUL公司
阿尔琼-再发
阿尔琼-再发
deshwal
scala中使用的代码
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load(FILE_PATH)
val result = df.groupBy("Name1", "Name2")
.agg(count(lit(1))
.alias("cnt"))获得产出:
尼基尔( nikhil )
安舒尔( anshul )
2. deshwal
阿尔琼( arjun )-中转站
所需产出:
尼基尔公司( nikhil )
anshul -私营金融公司
deshwal /arjun/arjun/arjun/arjun/arjun/arjun/4
或
尼基尔( nikhil )
安舒尔( anshul )
阿尔琼( arjun )
发布于 2016-09-13 06:27:59
我将使用一个集合来处理它,它不包含任何顺序,因此只对集合的内容进行比较:
scala> val data = Array(
| ("arjun", "deshwal"),
| ("nikhil", "choubey"),
| ("anshul", "pandyal"),
| ("arjun", "deshwal"),
| ("arjun", "deshwal"),
| ("deshwal", "arjun")
| )
data: Array[(String, String)] = Array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:29
scala> val distDataSets = distData.map(tup => (Set(tup._1, tup._2), 1)).countByKey()
distDataSets: scala.collection.Map[scala.collection.immutable.Set[String],Long] = Map(Set(nikhil, choubey) -> 1, Set(arjun, deshwal) -> 4, Set(anshul, pandyal) -> 1)希望这能有所帮助。
https://stackoverflow.com/questions/39462730
复制相似问题