首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >计算stopWord列表的结果单词

计算stopWord列表的结果单词
EN

Stack Overflow用户
提问于 2021-09-01 10:54:18
回答 1查看 186关注 0票数 1

我已经完成了数据清理与我的数据火花,包括删除停止字。删除停止字会为每一行生成一个列表,其中包含不是停止词的单词.现在,我想数一下该专栏中剩下的所有单词,以生成Word-CloudWord-Frequency

这是我的电火花数据

代码语言:javascript
复制
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
|             content|score|label|classWeigth|               words|            filtered|       terms_stemmed|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
|absolutely love d...|    5|    1|       0.48|[absolutely, love...|[absolutely, love...|[absolut, love, d...|
|absolutely love t...|    5|    1|       0.48|[absolutely, love...|[absolutely, love...|[absolut, love, g...|
|absolutely phenom...|    5|    1|       0.48|[absolutely, phen...|[absolutely, phen...|[absolut, phenome...|
|absolutely shocki...|    1|    0|       0.52|[absolutely, shoc...|[absolutely, shoc...|[absolut, shock, ...|
|accept the phone ...|    1|    0|       0.52|[accept, the, pho...|[accept, phone, n...|[accept, phone, n...|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+

terms_stemmed是最后一列,我想从它得到一个新的数据框架,如下所示:

代码语言:javascript
复制
+-------------+--------+
|terms_stemmed| count  |
+-------------+--------+
|app          |  592059|
|use          |  218178|
|good         |  187671|
|like         |  155304|
|game         |  149941|
|....         |   .... |

有人能帮我吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-09-01 11:01:24

一种选择是使用explode

代码语言:javascript
复制
import pyspark.sql.functions as F

new_df = df\
  .withColumn('terms_stemmed', F.explode('terms_stemmed'))\
  .groupby('terms_stemmed')\
  .count()

示例

代码语言:javascript
复制
import pyspark.sql.functions as F

df = spark.createDataFrame([
  (1, ["Apple", "Banana"]),
  (2, ["Banana", "Orange", "Banana"]),
  (3, ["Orange"])
], ("id", "terms_stemmed"))

df.show(truncate=False)

+---+------------------------+
|id |terms_stemmed           |
+---+------------------------+
|1  |[Apple, Banana]         |
|2  |[Banana, Orange, Banana]|
|3  |[Orange]                |
+---+------------------------+



new_df = df\
  .withColumn('terms_stemmed', F.explode('terms_stemmed'))\
  .groupby('terms_stemmed')\
  .count()

new_df.show()

+-------------+-----+
|terms_stemmed|count|
+-------------+-----+
|       Banana|    3|
|        Apple|    1|
|       Orange|    2|
+-------------+-----+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69012425

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档