首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >除某些情况外,按句点分割文本

除某些情况外,按句点分割文本
EN

Stack Overflow用户
提问于 2021-07-12 00:24:08
回答 3查看 193关注 0票数 1

目前,我正尝试用句子拆分包含整个文本文档的字符串,以便将其转换为csv。当然,我会使用句点作为分隔符并执行str.split('.'),但是文档中包含缩写“即”和“例如”在这种情况下,我想忽略这段时间。

例如,

原文:During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.

结果列表:["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]

到目前为止,我唯一的解决办法是替换所有的“I”和“例如”加上“ie”和“例如”,这既没有效率,也不受语法上的欢迎。我在摆弄Python的regex库,我怀疑它掌握着我想要的答案,但我对它的了解充其量不过是新手。

这是我第一次在这里发布一个问题,所以如果我使用不正确的格式或措辞,我很抱歉。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2021-07-12 01:42:29

参见我怎样才能把一篇课文分成句子?,它建议使用自然语言工具包

关于为什么这样做的更深层次的解释是通过一个例子来进行的:

我叫I. Brown。我敢打赌我会让一个句子很难解析。没有人比我更适合完成这项任务了。

你怎么把它分成不同的句子?

您需要语义(正规句子通常由主语、宾语和动词组成),而正则表达式无法捕捉这些语义。RegEx的语法非常好,但语义却不太好。

为了证明这一点,其他人提出的答案涉及到许多复杂的正则表达式,而且速度相当慢,以115票赞成,这将打破我的简单句子。

这是一个NLP问题,所以我联系了一个给出NLP软件包的答案。

票数 1
EN

Stack Overflow用户

发布于 2021-07-12 00:26:59

这个应该能用!

代码语言:javascript
复制
import re

p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."

list = []
while(len(p) > 0):
 string = ""
 while(True):
  match = re.search("[A-Z]+[^A-Z]+",p)
  if(match == None):
      break
  p = p[len(match.group(0)):]
  string += match.group(0)
  if(match.group(0).endswith(". ") ):
      break
 list.append(string)



print(list)
票数 1
EN

Stack Overflow用户

发布于 2021-07-12 01:53:31

这是一个粗略的实现。

代码语言:javascript
复制
inp = input()
res = []
last = 0
for x in range(len(inp)):
    if (x>1):
        if (inp[x] == "." and inp[x-2] != "."):
            if (x < len(inp)-2):
                if (inp[x+2] != "."):
                    res.append(inp[last:x])
                    last = x+2
res.append(inp[last:-1])
print(res)

如果我使用您的输入,我将得到这个输出(希望,这就是您要寻找的):

代码语言:javascript
复制
['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']

注意:如果您使用的文本不遵循语法规则(字母之间没有空格或在开始一个新句子之后),您可能需要调整这段代码。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68340797

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档