首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Spacy BILOU格式到spacy json格式

Spacy BILOU格式到spacy json格式
EN

Stack Overflow用户
提问于 2020-11-04 15:21:15
回答 1查看 841关注 0票数 1

我正在尝试升级我的spacy版本到夜间,特别是为了使用spacy transformers

因此我将spacy简单训练数据集转换为如下格式

td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]

以上至

[[{"head": 0, "dep": "", "tag": "", "orth": "Who", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "is", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "Shaka", "ner": "B-FRIENDS", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": "Khan", "ner": "L-FRIENDS", "id": 3}, {"head": 0, "dep": "", "tag": "", "orth": "?", "ner": "O", "id": 4}], [{"head": 0, "dep": "", "tag": "", "orth": "I", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "like", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "London", "ner": "U-LOC", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": ".", "ner": "O", "id": 3}]]

使用以下脚本

代码语言:javascript
复制
sentences = []
for t in td:
    doc = nlp(t[0])
    tags = offsets_to_biluo_tags(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].orth_,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)



with open("train_data.json","w") as js:
    json.dump(sentences,js)```


then i tried to convert this train_data.json using 
spacy's convert command

```python -m spacy convert train_data.json converted/```

但在已转换文件夹中的结果是

✔ Generated output file (0 documents): converted/train_data.spacy,这意味着它不会创建数据集。有人可以帮助我解决我错过的问题吗?我正在尝试使用spacy-nightly

代码语言:javascript
复制

来实现这一点

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-04 18:03:01

您可以跳过中间的JSON步骤,直接将批注转换为DocBin

代码语言:javascript
复制
import spacy
from spacy.training import Example
from spacy.tokens import DocBin

td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]

nlp = spacy.blank("en")
db = DocBin()

for text, annotations in td:
    example = Example.from_dict(nlp.make_doc(text), annotations)
    db.add(example.reference)

db.to_disk("td.spacy")

请参阅:https://nightly.spacy.io/usage/v3#migrating-training-python

(如果您确实想要使用中间JSON格式,下面是规范:https://spacy.io/api/annotation#json-input。您可以只将orthner包含在tokens中,而忽略其他特性,但是您需要在paragraphsrawsentences中使用这种结构。下面是一个示例:https://github.com/explosion/spaCy/blob/45c9a688285081cd69faa0627d9bcaf1f5e799a1/examples/training/training-data.json)

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64675654

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档