我有来自自然语言推理语料库(SNLI,multiNLI)的数据,其格式如下:
'( ( Two ( blond women ) ) ( ( are ( hugging ( one another ) ) ) . ) )'它们应该是一棵二叉树(有些不是很干净)。
我想把我自己的一些句子解析成这种格式。我怎样才能用NLTK或类似的工具做到这一点呢?
我已经找到了StanfordParser,但是我还没有找到如何获得这种解析。
发布于 2017-06-26 00:30:04
任何树都可以转换为保存其成分的二叉树。下面是一个简单的解决方案,适用于nltk.Tree输入:
from nltk import Tree
from functools import reduce
def binarize(tree):
"""
Recursively turn a tree into a binary tree.
"""
if isinstance(tree, str):
return tree
elif len(tree) == 1:
return binarize(tree[0])
else:
label = tree.label()
return reduce(lambda x, y: Tree(label, (binarize(x), binarize(y))), tree)如果您想要普通元组而不是Tree,请用以下方式替换最后一个return语句:
return reduce(lambda x, y: (binarize(x), binarize(y)), tree)示例:
>>> t = Tree.fromstring('''(ROOT (S (NP (NNP Oracle))
(VP (VBD had) (VP (VBN fought) (S (VP (TO to)
(VP (VB keep) (NP (DT the) (NNS forms))
(PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))''')
>>> bt = binarize(t)
>>> print(t)
(ROOT
(S
(NP (NNP Oracle))
(VP
(VBD had)
(VP
(VBN fought)
(S
(VP
(TO to)
(VP
(VB keep)
(NP (DT the) (NNS forms))
(PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))
>>> print(bt)
(S
Oracle
(VP
had
(VP
fought
(VP
to
(VP (VP keep (NP the forms)) (PP from (VP being released)))))))这将确保二进制结构,但它不一定是正确的结构。大覆盖率解析器生成非二进制分支,因为有些附件选择是众所周知的困难。(考虑一下经典的“我用望远镜看到女孩”;PP是“带着望远镜”在物体里面,还是VP的一部分?)所以要小心行事。
https://stackoverflow.com/questions/44742809
复制相似问题