文章/答案/技术大牛

发布

社区首页 >问答首页 >如何对文本进行预处理以删除断句？

问如何对文本进行预处理以删除断句？
EN

Stack Overflow用户

提问于 2021-09-28 11:22:10

回答 1查看 264关注 0票数 0

我想删除一个停止词的列表，即

from gensim.parsing.preprocessing import STOPWORDS
print(STOPWORDS)

在gensim中，这在函数中应该是非常简单的。

下面是我读取文本和删除断句的代码：

def read_text(text_path):
  text = []
  with open(text_path) as file:
    lines = file.readlines()
    for index, line in enumerate(lines):
      text.append(simple_preprocess(remove_stopwords(line)))
  return text

text = read_text('/content/text.txt')
text =  [x for x in text if x]
text[:3]

这是我得到的输出，其中包含诸如"we“或”然而“之类的单词，应该从原文中删除，例如，" the”已经从第一个集合中正确地删除了。我很困惑..。我在这里错过了什么？

[['clinical', 'guidelines', 'management', 'ibd'],
 ['polygenetic',
  'risk',
  'scores',
  'add',
  'predictive',
  'power',
  'clinical',
  'models',
  'response',
  'anti',
  'tnfα',
  'therapy',
  'inflammatory',
  'bowel',
  'disease'],
 ['anti',
  'tumour',
  'necrosis',
  'factor',
  'alpha',
  'tnfα',
  'therapy',
  'widely',
  'management',
  'crohn',
  'disease',
  'cd',
  'ulcerative',
  'colitis',
  'uc',
  'however',
  'patients',
  'respond',
  'induction',
  'therapy',
  'patients',
  'lose',
  'response',
  'time',
  'to',
  'aid',
  'patient',
  'stratification',
  'polygenetic',
  'risk',
  'scores',
  'identified',
  'predictors',
  'response',
  'anti',
  'tnfα',
  'therapy',
  'we',
  'aimed',
  'replicate',
  'association',
  'polygenetic',
  'risk',
  'scores',
  'response',
  'anti',
  'tnfα',
  'therapy',
  'independent',
  'cohort',
  'patients',
  'establish',
  'clinical',
  'validity']]

文本(可用的完整文件这里)

IBD管理的临床指南。

多基因风险评分不增加对炎症性肠病抗肿瘤坏死因子α治疗反应的临床模型的预测能力。抗肿瘤坏死因子α(α)治疗广泛应用于克罗恩病(CD)和溃疡性结肠炎(UC)的治疗。然而，多达三分之一的患者对诱导疗法没有反应，另外三分之一的患者随着时间的推移而失去反应。为了帮助患者分层，多基因风险评分已被确定为对抗肿瘤坏死因子α治疗的反应的预测因子。我们的目的是复制多基因风险评分与抗肿瘤坏死因子α治疗的反应之间的联系，在一个独立的队列中，以建立其临床有效性。

python

nlp

gensim

word2vec

stop-words

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-09-28 12:05:49

remove_stopwords()函数区分大小写，不忽略标点符号。例如，“然而”不在STOPWORDS中，而是在“然而”中。您应该首先调用simple_preprocess()函数。这应该是可行的：

from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopword_tokens

def read_text(text_path):
  text = []
  with open(text_path) as file:
    lines = file.readlines()
    for index, line in enumerate(lines):
      tokens = simple_preprocess(line)
      text.append(remove_stopword_tokens(tokens,stopwords=STOPWORDS))
  return text

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69360816

复制

相似问题

问如何对文本进行预处理以删除断句？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何对文本进行预处理以删除断句？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何对文本进行预处理以删除断句？
EN