目前,我正尝试用句子拆分包含整个文本文档的字符串,以便将其转换为csv。当然,我会使用句点作为分隔符并执行str.split('.'),但是文档中包含缩写“即”和“例如”在这种情况下,我想忽略这段时间。
例如,
原文:During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.
结果列表:["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]
到目前为止,我唯一的解决办法是替换所有的“I”和“例如”加上“ie”和“例如”,这既没有效率,也不受语法上的欢迎。我在摆弄Python的regex库,我怀疑它掌握着我想要的答案,但我对它的了解充其量不过是新手。
这是我第一次在这里发布一个问题,所以如果我使用不正确的格式或措辞,我很抱歉。
发布于 2021-07-12 01:42:29
参见我怎样才能把一篇课文分成句子?,它建议使用自然语言工具包。
关于为什么这样做的更深层次的解释是通过一个例子来进行的:
我叫I. Brown。我敢打赌我会让一个句子很难解析。没有人比我更适合完成这项任务了。
你怎么把它分成不同的句子?
您需要语义(正规句子通常由主语、宾语和动词组成),而正则表达式无法捕捉这些语义。RegEx的语法非常好,但语义却不太好。
为了证明这一点,其他人提出的答案涉及到许多复杂的正则表达式,而且速度相当慢,以115票赞成,这将打破我的简单句子。
这是一个NLP问题,所以我联系了一个给出NLP软件包的答案。
发布于 2021-07-12 00:26:59
这个应该能用!
import re
p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."
list = []
while(len(p) > 0):
string = ""
while(True):
match = re.search("[A-Z]+[^A-Z]+",p)
if(match == None):
break
p = p[len(match.group(0)):]
string += match.group(0)
if(match.group(0).endswith(". ") ):
break
list.append(string)
print(list)发布于 2021-07-12 01:53:31
这是一个粗略的实现。
inp = input()
res = []
last = 0
for x in range(len(inp)):
if (x>1):
if (inp[x] == "." and inp[x-2] != "."):
if (x < len(inp)-2):
if (inp[x+2] != "."):
res.append(inp[last:x])
last = x+2
res.append(inp[last:-1])
print(res)如果我使用您的输入,我将得到这个输出(希望,这就是您要寻找的):
['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']注意:如果您使用的文本不遵循语法规则(字母之间没有空格或在开始一个新句子之后),您可能需要调整这段代码。
https://stackoverflow.com/questions/68340797
复制相似问题