首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >有办法在JSON文件的字段上运行spaCy NER吗?

有办法在JSON文件的字段上运行spaCy NER吗?
EN

Stack Overflow用户
提问于 2020-12-31 01:23:05
回答 1查看 808关注 0票数 0

我试图用spaCy (地理)实体来分析出现在JSON文件json-capLatLong.json字段中的实体。该文件如下所示:

代码语言:javascript
复制
[{
    "caption": "Grassland north of Eastdon ",
    "ground truth toponym": "Eastdon",
    "guide-latitude-WGS84": "50.611614",
    "guide-longitude-WGS84": "-3.447207",
    "disambiguated": "Eastdon, Teignbridge, Devon, South West England, England, EX6 8RH, United Kingdom"
}, {
    "caption": "Wall north of Hulne Park ",
    "ground truth toponym": "Hulne Park",
    "guide-latitude-WGS84": "55.446522",
    "guide-longitude-WGS84": "-1.748779",
    "disambiguated": "Hulne Park, Denwick, Alnwick, Northumberland, North East England, England, United Kingdom"
}, {
    "caption": "Farm track north of Aglionby ",
    "ground truth toponym": "Aglionby",
    "guide-latitude-WGS84": "54.908579",
    "guide-longitude-WGS84": "-2.866381",
    "disambiguated": "Aglionby, Carlisle, Cumbria, North West England, England, CA4 8AJ, United Kingdom"
}, {
    "caption": "Long barrow north of Martin ",
    "ground truth toponym": "Martin",
    "guide-latitude-WGS84": "50.98477",
    "guide-longitude-WGS84": "-1.910483",
    "disambiguated": "Martin, Hampshire, South East, England, SP6 3LF, United Kingdom"
}, {
    "caption": "A483 north of Pool Quay ",
    "ground truth toponym": "Pool Quay",
    "guide-latitude-WGS84": "52.701294",
    "guide-longitude-WGS84": "-3.098761",
    "disambiguated": "Pool Quay, Powys, Wales, SY21 9JS, United Kingdom"
}, {
    "caption": "Power line north of Dagnets Lane ",
    "ground truth toponym": "Dagnets Lane",
    "guide-latitude-WGS84": "51.846349",
    "guide-longitude-WGS84": "0.537283",
    "disambiguated": "Dagnets Lane, Black Notley, Braintree, Essex, East of England, England, CM77 8QP, United Kingdom"
}, {
    "caption": "Fields north of Ellington ",
    "ground truth toponym": "Ellington",
    "guide-latitude-WGS84": "52.347205",
    "guide-longitude-WGS84": "-0.291146",
    "disambiguated": "Ellington, Cambridgeshire, East of England, England, United Kingdom"
}, {
    "caption": "Fields north of Belsey Bridge Road ",
    "ground truth toponym": "Belsey Bridge Road",
    "guide-latitude-WGS84": "52.479252",
    "guide-longitude-WGS84": "1.428283",
    "disambiguated": "Belsey Bridge Road, Ditchingham, South Norfolk, Norfolk, East of England, England, NR35 2DT, United Kingdom"
}, {
    "caption": "Pasture north of Balhomish ",
    "ground truth toponym": "Balhomish",
    "guide-latitude-WGS84": "56.544822",
    "guide-longitude-WGS84": "-3.605378",
    "disambiguated": "Balhomish, Inver, Perth and Kinross, Scotland, PH8 0DX, United Kingdom"
}, {
    "caption": "The A22 north of South Godstone ",
    "ground truth toponym": "South Godstone",
    "guide-latitude-WGS84": "51.222992",
    "guide-longitude-WGS84": "-0.04726",
    "disambiguated": "South Godstone, Surrey, South East, England, RH9 8HS, United Kingdom"
}, {
    "caption": "Farm on track east of Hardwick",
    "ground truth toponym": "Hardwick",
    "guide-latitude-WGS84": "51.866063",
    "guide-longitude-WGS84": "-0.826492",
    "disambiguated": "Hardwick, Buckinghamshire, South East, England, HP22 4DX, United Kingdom"
}, {
    "caption": "Un-named lane east of Clare",
    "ground truth toponym": "Clare",
    "guide-latitude-WGS84": "51.681005",
    "guide-longitude-WGS84": "-1.02134",
    "disambiguated": "Clare, South Oxfordshire, Oxfordshire, South East, England, OX9 7HF, United Kingdom"
}]

我对分析caption专栏感兴趣。

我知道如何处理字符串,也就是使用类似于

代码语言:javascript
复制
import spacy
import json

nlp = spacy.load("en_core_web_sm")

doc = nlp("Grassland north of Eastdon")

for ent in doc.ents:

    print(ent.text, ent.label_)
    # this gives me in output "Grassland GPE"

因此,要提取输入字符串中的所有FAC、GPE或LOC。但是,我如何处理所有的JSON文件呢?

EN

回答 1

Stack Overflow用户

发布于 2020-12-31 02:04:43

一个起点可以是沿着

代码语言:javascript
复制
import pandas as pd
import spacy

df = pd.read_json("json-capLatLong.json")

nlp = spacy.load("en_core_web_sm")

def get_toponyms(caption):
  return [e.text for e in nlp(caption).ents if e.label_ in ["GPE", "FAC", "LOC"]]

df["extracted toponyms"] = df.caption.apply(get_toponyms)
print(df["extracted toponyms"])

这个想法是在你的json和熊猫一起阅读,然后应用到caption字段/列,就像在Applying SpaCy's EntityRecognizer to a column within a Pandas dataframe中一样。

get_toponyms中的列表理解只过滤与地理实体对应的文本。

然而,我担心你对地面真相的回忆会很低,可能是因为这些地名中的一些并不为模型所知。使用更大的模型(例如en_core_web_lg)可能有助于提高回忆。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65515587

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档