我已经完成了数据清理与我的数据火花,包括删除停止字。删除停止字会为每一行生成一个列表,其中包含不是停止词的单词.现在,我想数一下该专栏中剩下的所有单词,以生成Word-Cloud或Word-Frequency。
这是我的电火花数据
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
| content|score|label|classWeigth| words| filtered| terms_stemmed|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
|absolutely love d...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, d...|
|absolutely love t...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, g...|
|absolutely phenom...| 5| 1| 0.48|[absolutely, phen...|[absolutely, phen...|[absolut, phenome...|
|absolutely shocki...| 1| 0| 0.52|[absolutely, shoc...|[absolutely, shoc...|[absolut, shock, ...|
|accept the phone ...| 1| 0| 0.52|[accept, the, pho...|[accept, phone, n...|[accept, phone, n...|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+terms_stemmed是最后一列,我想从它得到一个新的数据框架,如下所示:
+-------------+--------+
|terms_stemmed| count |
+-------------+--------+
|app | 592059|
|use | 218178|
|good | 187671|
|like | 155304|
|game | 149941|
|.... | .... |有人能帮我吗?
发布于 2021-09-01 11:01:24
一种选择是使用explode
import pyspark.sql.functions as F
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()示例
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, ["Apple", "Banana"]),
(2, ["Banana", "Orange", "Banana"]),
(3, ["Orange"])
], ("id", "terms_stemmed"))
df.show(truncate=False)
+---+------------------------+
|id |terms_stemmed |
+---+------------------------+
|1 |[Apple, Banana] |
|2 |[Banana, Orange, Banana]|
|3 |[Orange] |
+---+------------------------+
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()
new_df.show()
+-------------+-----+
|terms_stemmed|count|
+-------------+-----+
| Banana| 3|
| Apple| 1|
| Orange| 2|
+-------------+-----+https://stackoverflow.com/questions/69012425
复制相似问题