文章/答案/技术大牛

发布

社区首页 >问答首页 >将NLTK短语结构树转换为BRAT .ann standoff

问将NLTK短语结构树转换为BRAT .ann standoff
EN

Stack Overflow用户

提问于 2014-04-18 01:32:00

回答 1查看 1.8K关注 0票数 3

我在试着注释一个纯文本语料库。我使用的是系统功能语法，这在词性注释方面是相当标准的，但在短语/块方面却有所不同。

因此，我已经用NLTK默认值标记了我的数据，并使用nltk.RegexpParser生成了正则块。基本上，输出现在是一个NLTK风格的短语结构树：

树(‘S’，[Tree('Clause'，[Tree‘进程-依赖’)，[Tree(‘参与者’，(‘这里’，'DT'))，树(‘动词-组’，('is'，‘VBZ’)，树(‘参与者’，('a'，'DT')，(‘表征’，‘NN’)，树(‘环境’，('of'，'IN')，(‘，'DT')，(“语法”，“NN”)])

但是，我想在此之上手动注释一些东西:系统语法将参与者和语言组分解为可能无法自动注释的子类型。因此，我希望将解析树格式转换为注释工具(最好是BRAT)能够处理的内容，然后遍历文本并手动指定子类型，如(一种可能的解决方案)：

也许解决办法是诱骗BRAT把短语结构当作依赖项来处理？如果需要的话我可以修改分块正则表达式。外面有转换器吗？(Brat提供了从CONLL2000和Stanford转换的方法，所以如果我能将短语结构转换成任何一种形式，这也是可以接受的。)

谢谢!

python

nlp

nltk

stanford-nlp

corpus

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-05-12 15:03:35

将非二叉树表示为as将很困难，但可以嵌套“实体”注释并将其用于选区解析结构。请注意，我没有为树的终端(语音标记的一部分)创建节点，部分原因是Brat目前不擅长显示通常适用于终端的一元规则。目标格式的描述是这里。

首先，我们需要一个函数来生成僵持注释。虽然Brat在字符方面寻求对峙，但在下面我们只使用令牌偏移，并将转换为下面的字符。

(注意，这使用了NLTK3.0B和Python 3)

def _standoff(path, leaves, slices, offset, tree):
    width = 0
    for i, child in enumerate(tree):
        if isinstance(child, tuple):
            tok, tag = child
            leaves.append(tok)
            width += 1
        else:
            path.append(i)
            width += _standoff(path, leaves, slices, offset + width, child)
            path.pop()
    slices.append((tuple(path), tree.label(), offset, offset + width))
    return width


def standoff(tree):
    leaves = []
    slices = []
    _standoff([], leaves, slices, 0, tree)
    return leaves, slices

将此应用于您的示例：

>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
 [((0, 0, 0), 'Participant', 0, 1),
  ((0, 0, 1), 'Verbal-group', 1, 2),
  ((0, 0, 2), 'Participant', 2, 4),
  ((0, 0, 3), 'Circumstance', 4, 7),
  ((0, 0), 'Process-dependencies', 0, 7),
  ((0,), 'Clause', 0, 7),
  ((), 'S', 0, 8)])

这将返回叶标记，然后是包含元素的元组对应子树的列表：(索引为根、标签、开始叶、停止叶)。

要将其转换为字符对峙，请执行以下操作：

def char_standoff(tree):
    leaves, tok_standoff = standoff(tree)
    text = ' '.join(leaves)
    # Map leaf index to its start and end character
    starts = []
    offset = 0
    for leaf in leaves:
        starts.append(offset)
        offset += len(leaf) + 1
    starts.append(offset)
    return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
                  for path, label, start_tok, end_tok in tok_standoff]

然后：

>>> char_standoff(tree)
('This is a representation of the grammar .',
 [((0, 0, 0), 'Participant', 0, 4),
  ((0, 0, 1), 'Verbal-group', 5, 7),
  ((0, 0, 2), 'Participant', 8, 24),
  ((0, 0, 3), 'Circumstance', 25, 39),
  ((0, 0), 'Process-dependencies', 0, 39),
  ((0,), 'Clause', 0, 39),
  ((), 'S', 0, 41)])

最后，我们可以编写一个将其转换为Brat格式的函数：

def write_brat(tree, filename_prefix):
    text, standoff = char_standoff(tree)
    with open(filename_prefix + '.txt', 'w') as f:
        print(text, file=f)
    with open(filename_prefix + '.ann', 'w') as f:
        for i, (path, label, start, stop) in enumerate(standoff):
            print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)

这会将以下内容写入/path/ to /omething.txt

This is a representation of the grammar .

这是到/path/ to /某事。

T0  Participant 0 4 This
T1  Verbal-group 5 7    is
T2  Participant 8 24    a representation
T3  Circumstance 25 39  of the grammar
T4  Process-dependencies 0 39   This is a representation of the grammar
T5  Clause 0 39 This is a representation of the grammar
T6  S 0 41  This is a representation of the grammar .

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23146072

复制

相似问题

问将NLTK短语结构树转换为BRAT .ann standoff
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将NLTK短语结构树转换为BRAT .ann standoffEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将NLTK短语结构树转换为BRAT .ann standoff
EN