我有一个有文本的数据文件。有些词,如“不是”、“不能”,etc..which需要扩展。
例如:
I'd -> I would
I'd -> I had下面是数据
DataFrame
temp = spark.createDataFrame([
(0, "Julia isn't awesome"),
(1, "I wish Java-DL couldn't use case-classes"),
(2, "Data-science wasn't my subject"),
(3, "Machine")
], ["id", "words"])
+---+----------------------------------------+
|id |words |
+---+----------------------------------------+
|0 |Julia isn't awesome |
|1 |I wish Java-DL couldn't use case-classes|
|2 |Data-science wasn't my subject |
|3 |Machine |
+---+----------------------------------------+我正试图在pyspark中搜索一个库,但还没有it..How来实现这一点吗?
输出:
+---+-----------------------------------------+
|id |words |
+---+-----------------------------------------+
|0 |Julia is not awesome |
|1 |I wish Java-DL could not use case-classes|
|2 |Data-science was not my subject |
|3 |Machine |
+---+-----------------------------------------+发布于 2022-07-27 07:47:46
可能没有一个pyspark库来完成这个任务,但是您可以使用任何python库。有几种解决方案这里。例如,如果您使用宫缩库,那么您可以编写一个函数并将其apply()到dataframe。
from pycontractions import Contractions
# Load your favorite word2vec model - need to download this, available at pycontractions ink
cont = Contractions('GoogleNews-vectors-negative300.bin')
# optional, prevents loading on first expand_texts call
cont.load_models()
def expand_contractions(text):
out = list(cont.expand_texts([text], precise=True))
return out[0]
temp = temp.withColumn('expanded_words', temp['words'].apply(expand_contractions))https://stackoverflow.com/questions/73132175
复制相似问题