我正在从水上事故报告中提取因果语句。我在这里使用NLTK作为工具。我以20个因果句式为例,手动创建了我的regExp语法。构造的语法是一种类型。
grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''现在,语法在测试集上有100%的回忆(我用50个因果句和50个非因果句构建了自己的玩具数据集),但精确度很低。我想问一下:
- There was poor sanitation in the village, as a consequence, she had health problems.
- The water was impure in her village, For this reason, she suffered from parasites.
- She had health problems because of poor sanitation in the village. I would want to extract only the above type of sentences from a large text.
发布于 2012-10-25 23:49:39
与这本书的作者进行了简短的讨论:“PythonTextProcesswithNLTK2.0Cookbook”,Mr.Jacob Perkins。他说:“句子的广义语法是很难的。相反,我想看看你是否能找到常见的标记模式,并使用它们。但实际上,你是通过regexp匹配来进行分类的。解析通常用于提取句子中的短语,或者生成句子的深层解析树,但你只是试图识别/提取句子,这就是为什么我认为分类是一种更好的方法。在尝试这种方法时,考虑将标记词作为特征,因为语法可能很重要。”根据他的建议,我看了一下我的因果句,我发现这些句子中有这样的词
consequently
as a result
Therefore
as a consequence
For this reason
For all these reasons
Thus
because
since
because of
on account of
due to
for the reason
so, that这些词在句子中确实是把因果联系在一起的。使用这些连接器,现在很容易提取因果句。详细报告可在arxiv:https://arxiv.org/pdf/1507.02447.pdf上找到
https://stackoverflow.com/questions/13068386
复制相似问题