文章/答案/技术大牛

发布

社区首页 >问答首页 >只有在python 2中执行停止字词删除时，才会在令牌化步骤中出现Unicode错误

问只有在python 2中执行停止字词删除时，才会在令牌化步骤中出现Unicode错误
EN

Stack Overflow用户

提问于 2022-03-06 06:40:52

回答 1查看 77关注 0票数 0

我试图运行这个脚本：在这里输入链接描述 (唯一的区别是，我需要读取数据集(列文本)，而不是这个TEST_SENTENCES )。唯一的问题是，在将停止字移除到代码的其余部分之前，我需要将停止字移除到该列。

df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
                            'The wireless internet was unreliable. ', 'i am still her . :). ',
                            'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
    'positive', 'negative', 'neutral', 'positive', 'neutral']})

，但是当我以这种方式使用数据帧时，错误不会产生，而是当我使用包含完全相同数据的csv文件时，错误会引发.。

但是当我添加这行代码以删除stop_words时

df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']

它一直引发此错误：ValueError: All sentences should be Unicode-encoded!

此外，在令牌化步骤中会引发错误：

tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)

我想知道这里发生了什么，它会导致这个错误，以及修复代码的正确解决方案。

(我尝试过不同的编码方式，如uff-8等，但没有工作)

pandas

python-2.7

csv

unicode

stop-words

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-03-10 18:28:28

我还不知道原因，但当我知道

df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')

啊，真灵。

仍然很想知道为什么只有当我做stop words removal时才会发生这种情况

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71368195

复制

相似问题

问只有在python 2中执行停止字词删除时，才会在令牌化步骤中出现Unicode错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问只有在python 2中执行停止字词删除时，才会在令牌化步骤中出现Unicode错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问只有在python 2中执行停止字词删除时，才会在令牌化步骤中出现Unicode错误
EN