文章/答案/技术大牛

发布

社区首页 >问答首页 >如何用nltk库从dataframes中分割句子？

问如何用nltk库从dataframes中分割句子？
EN

Stack Overflow用户

提问于 2020-03-26 09:01:06

回答 1查看 545关注 0票数 0

我想要创造袋的文字模型，但与计算相对频率与nltk包。我的数据是用熊猫数据建立的。

这是我的数据：

text    title   authors label
0   On Saturday, September 17 at 8:30 pm EST, an e...   Another Terrorist Attack in NYC…Why Are we STI...   ['View All Posts', 'Leonora Cravotta']  Real
1   Story highlights "This, though, is certain: to...   Hillary Clinton on police shootings: 'too many...   ['Mj Lee', 'Cnn National Politics Reporter']    Real
2   Critical Counties is a CNN series exploring 11...   Critical counties: Wake County, NC, could put ...   ['Joyce Tseng', 'Eli Watkins']  Real
3   McCain Criticized Trump for Arpaio’s Pardon… S...   NFL Superstar Unleashes 4 Word Bombshell on Re...   []  Real
4   Story highlights Obams reaffirms US commitment...   Obama in NYC: 'We all have a role to play' in ...   ['Kevin Liptak', 'Cnn White House Producer']    Real
5   Obama weighs in on the debate\n\nPresident Bar...   Obama weighs in on the debate   ['Brianna Ehley', 'Jack Shafer']    Real

我试着把它转换成字符串

import nltk 
import numpy as np
import random
import bs4 as bs
import re

data = df.astype(str)
data

但是，当我试图标记这个单词时，它有如下错误

corpus = nltk.sent_tokenize(data['text'])

TypeError: expected string or bytes-like object

但似乎行不通：(有人知道如何标记“文本”列中的每一行句子吗？)

python

pandas

nlp

回答 1

Stack Overflow用户

发布于 2020-03-26 09:47:12

nltk.tokenize()要求输入为字符串，您将得到错误，因为您正在直接传递一个pandas.Series对象：

试着用这个词来标记：

data['Corpus'] = df.text.apply(lambda x: nltk.word_tokenize(x))

对于sent_tokenize修改：

data['Sent'] = df.text.apply(lambda x: nltk.sent_tokenize(x))

如果您还想去掉标点符号：

data['no_punc'] = df.text.apply(lambda x: nltk.RegexpTokenizer(r'\w+').tokenize(x))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60863819

复制

相似问题

问如何用nltk库从dataframes中分割句子？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用nltk库从dataframes中分割句子？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用nltk库从dataframes中分割句子？
EN