首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从XML注释转换为BRAT格式

从XML注释转换为BRAT格式
EN

Stack Overflow用户
提问于 2018-04-29 23:06:58
回答 1查看 772关注 0票数 3

我有一个XML格式的带注释的数据集:参见下面的示例

代码语言:javascript
复制
Treatment of <annotation cui="C0267055">Erosive Esophagitis</annotation> in patients

标记的单词在XML标记中的位置,如图所示。我需要将其转换为BRAT格式,例如:

代码语言:javascript
复制
T1    annotation 14 33    Erosive Esophagitis

可以在http://brat.nlplab.org/standoff.html中找到更多示例

我可以在Python中使用正则表达式提取注释,但我不确定如何将其转换为正确的BRAT格式。有没有可能有解决这个问题的工具?

EN

回答 1

Stack Overflow用户

发布于 2019-05-05 16:51:07

如果有人仍然需要这个问题的答案,这里有一个解决方案。

假设一个XML文件sample.xml具有以下结构:

代码语言:javascript
复制
<root>
<p n='1'>Hi, my name is <fname>Mickey</fname> <lname>Mouse</lname>, and what about yourself?</p>
<p n='2'>Nice meeting you, <fname>Mickey</fname>! I am <fname>Minnie</lname>!</p>
</root>

这是一个Python解决方案:

代码语言:javascript
复制
# leave empty if there are no tags that should not be interpreted as named entities; or add more
ignoretags = ['root', 'p']

# dictionary, in case some named entities have to be mapped; or just a list of tags that represent NEs
replacetags = {
    "fname": "PERS",
    "lname": "PERS"
}

# read content
content = open('sample.xml', encoding='utf-8').read()

# output files for BRAT: txt and annotations
f_txt = open('sample.txt', 'w')
f_ann = open('sample.ann', 'w')

# from txt file remove NE tags
clean_content = content
for replacetag in replacetags:
    clean_content = clean_content.replace('<{}>'.format(replacetag), '')
    clean_content = clean_content.replace('</{}>'.format(replacetag), '')

# write content to file
f_txt.write(clean_content)

# char by char
n = len(content)
i = - 1

# token id
tid = 0
# other - for output
start = -1
end = - 1
token = None
tag = None

# let's start parsing! character by character
skipped_chars = 0
number_of_tags = 0

token_buffer = ''
while i < n - 1:

    i += 1
    c = content[i]

    # beginning of an entity
    if c == '<':

    # probably the most important part: always track the count of skipped characters
        start = i - skipped_chars

        # get name of the entity
        tag_buffer = ''
        i += 1
        while content[i] != '>':
            tag_buffer += content[i]
            i += 1
        tag = tag_buffer

    # skip tags that are not NEs
        if tag not in replacetags:
            continue

        # get entity itself
        ent_buffer = ''
        i += 1
        while content[i] != '<':
            ent_buffer += content[i]
            i += 1
        token = ent_buffer

    # determine positions
        end = start + len(token)
        number_of_tags += 1

    # <fname></fname> i.e. 1 + len('fname') + 1 + 1 + 1 + len('fname') + 1
        skipped_chars += 1 + len(tag) + 1 + 1 + 1 + len(tag) + 1
        tid += 1

    # write annotation
        f_ann.write('T{}\t{} {} {}\t{}\n'.format(tid, replacetags[tag], start, end, token))

        # navigate to the end of the entity span, e.g. go behind <fname>...</fname>
        i += 1 + len(tag) + 1

sample.txt的内容

代码语言:javascript
复制
<root>
<p n='1'>Hi, my name is Mickey Mouse, and what about yourself?</p>
<p n='2'>Nice meeting you, Mickey! I am Minnie!</p>
</root>

sample.ann的内容

代码语言:javascript
复制
T1  PERS 31 37  Mickey
T2  PERS 38 43  Mouse
T3  PERS 101 107    Mickey
T4  PERS 114 120    Minnie

在BRAT中的视觉效果:

在属性的情况下需要一个小的调整(我在replacetags字典中添加了另一个键'att‘,即一对将被"fname": {"tag": "PERS", "att": "value of attribute"},然后在具有属性的标签的情况下将写入额外的一行。

希望有人会觉得这篇文章很有帮助!

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50088032

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档