我试图使用regex从文本中提取段落,其形式如下:
<0.30.1.92.13.39.38.marian+@MARIAN.ADM.CS.CMU.EDU (Marian D'Amico).0>
Type: cmu.cs.scs
Topic: LOGIC COLLOQUIUM
Dates: 6-Feb-92
Time: 3:30
Host: Stephen D. Brookes
PostedBy: marian+ on 30-Jan-92 at 13:39 from MARIAN.ADM.CS.CMU.EDU
(Marian D'Amico)
Abstract:
***********************************************************************
Logic Colloquium
Thursday February 6
3:30 Wean 5409
**********************************************************************
On The Mathematics of Non-monotonic Reasoning
Menachem Magidor
Hebrew University of Jerusalem
(Joint work with Daniel Lehman)
Non-monotonic reasoning is an attempt to develop reasoning systems
where an inference means that the conclusion holds in the "normal
case",
in "most cases", but it does not necessarily hold in all cases. It
seems
that this type of reasoning is needed if one wants to model everyday
common-sense reasoning. There have been many models suggested for
non-monotonic reasoning (like circumscription, default logic,
autoepistemic logic, etc). We study all these approaches in a more
abstract fashion by considering the inference relation of the
reasoning system, and clarify the role of different inference rules
and the impact they have on the model theory of the logic. We are
especially interested in a particular rule called "Rational Monotony"
and the connection between it and probabilistic models.
NOTE: Prof. Magidor will also give a Math Department Colloquium on
Friday
February 7.
-------------------------
Host: Stephen D. Brookes
Appointments can be made through Marian D'Amico, marian@cs, x7665.我目前正在尝试: paragraphRegex =paragraphRegex\s*)(?=\n\n)‘
然而,这个正则表达式捕获了一些情况,而在另一些情况下,它要么不捕获段落,要么挂起。
如能提供任何帮助,将不胜感激。
发布于 2018-11-26 18:53:17
我会尝试另一种方法。
您可以根据新行拆分文本:
texts = text.split('\n')然后,开发一个测试,以确定文本是否是电子邮件正文的一部分或其他内容。可能会在前面/下面的行为空白的地方查找文本块。像这样的东西可能会起作用:
段落= []
for i, text in enumerate(texts):
if i>0:
if (text != '' and texts[i-1] == '' and texts[i+1]):
paragraphs.append(text)顺便说一句,您只能使用regexp实现这一目标。大多数文本数据源的格式通常都有很大的变化,您的正则表达式永远无法捕捉到每一种边缘情况。我不得不这样做一次,构建分类模型来识别段落更健壮(也更容易)。
这是自己的研究项目,但如果你这样做的话,看看对项频率-逆文档频率(TF)和支持向量分类器(SVC),除非你有很多好的训练数据,否则不要让任何人说服你使用神经网络。
https://stackoverflow.com/questions/53486745
复制相似问题