我正在尝试将文本预处理应用到熊猫栏中,并使用spacy。我的目标是应用预处理,然后使用这个干净的列与其他列进行进一步的分析。
数据:
category content
0 business Quarterly profits at US media giant TimeWarne...
1 business The dollar has hit its highest level against ...
2 business The owners of embattled Russian oil giant Yuk...
3 business British Airways has blamed high fuel prices f...
4 business Shares in UK drinks and food firm Allied Dome...My预处理:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(str(df['content']))
new_corpus = [[words.lemma_ for words in docs if (not words.is_stop and not words.is_punct and not words.like_num)] for docs in doc]
corpus_clean = [[word.lower() for word in docu if (word.isalpha())] for docu in new_corpus]错误:
TypeError:'spacy.tokens.token.Token‘对象不可迭代
发布于 2022-07-28 14:58:34
数据转换有问题。
您希望获得一个“内容”列表,但相反,您将内容列转换为字符串。
您应该更改这一行:
doc = nlp(str(df['content']))
对此:
doc = nlp.pipe(df['content'].tolist())
https://stackoverflow.com/questions/73154708
复制相似问题