文章/答案/技术大牛

发布

社区首页 >问答首页 >有办法在JSON文件的字段上运行spaCy NER吗？

问有办法在JSON文件的字段上运行spaCy NER吗？
EN

Stack Overflow用户

提问于 2020-12-31 01:23:05

回答 1查看 808关注 0票数 0

我试图用spaCy (地理)实体来分析出现在JSON文件json-capLatLong.json字段中的实体。该文件如下所示：

[{
    "caption": "Grassland north of Eastdon ",
    "ground truth toponym": "Eastdon",
    "guide-latitude-WGS84": "50.611614",
    "guide-longitude-WGS84": "-3.447207",
    "disambiguated": "Eastdon, Teignbridge, Devon, South West England, England, EX6 8RH, United Kingdom"
}, {
    "caption": "Wall north of Hulne Park ",
    "ground truth toponym": "Hulne Park",
    "guide-latitude-WGS84": "55.446522",
    "guide-longitude-WGS84": "-1.748779",
    "disambiguated": "Hulne Park, Denwick, Alnwick, Northumberland, North East England, England, United Kingdom"
}, {
    "caption": "Farm track north of Aglionby ",
    "ground truth toponym": "Aglionby",
    "guide-latitude-WGS84": "54.908579",
    "guide-longitude-WGS84": "-2.866381",
    "disambiguated": "Aglionby, Carlisle, Cumbria, North West England, England, CA4 8AJ, United Kingdom"
}, {
    "caption": "Long barrow north of Martin ",
    "ground truth toponym": "Martin",
    "guide-latitude-WGS84": "50.98477",
    "guide-longitude-WGS84": "-1.910483",
    "disambiguated": "Martin, Hampshire, South East, England, SP6 3LF, United Kingdom"
}, {
    "caption": "A483 north of Pool Quay ",
    "ground truth toponym": "Pool Quay",
    "guide-latitude-WGS84": "52.701294",
    "guide-longitude-WGS84": "-3.098761",
    "disambiguated": "Pool Quay, Powys, Wales, SY21 9JS, United Kingdom"
}, {
    "caption": "Power line north of Dagnets Lane ",
    "ground truth toponym": "Dagnets Lane",
    "guide-latitude-WGS84": "51.846349",
    "guide-longitude-WGS84": "0.537283",
    "disambiguated": "Dagnets Lane, Black Notley, Braintree, Essex, East of England, England, CM77 8QP, United Kingdom"
}, {
    "caption": "Fields north of Ellington ",
    "ground truth toponym": "Ellington",
    "guide-latitude-WGS84": "52.347205",
    "guide-longitude-WGS84": "-0.291146",
    "disambiguated": "Ellington, Cambridgeshire, East of England, England, United Kingdom"
}, {
    "caption": "Fields north of Belsey Bridge Road ",
    "ground truth toponym": "Belsey Bridge Road",
    "guide-latitude-WGS84": "52.479252",
    "guide-longitude-WGS84": "1.428283",
    "disambiguated": "Belsey Bridge Road, Ditchingham, South Norfolk, Norfolk, East of England, England, NR35 2DT, United Kingdom"
}, {
    "caption": "Pasture north of Balhomish ",
    "ground truth toponym": "Balhomish",
    "guide-latitude-WGS84": "56.544822",
    "guide-longitude-WGS84": "-3.605378",
    "disambiguated": "Balhomish, Inver, Perth and Kinross, Scotland, PH8 0DX, United Kingdom"
}, {
    "caption": "The A22 north of South Godstone ",
    "ground truth toponym": "South Godstone",
    "guide-latitude-WGS84": "51.222992",
    "guide-longitude-WGS84": "-0.04726",
    "disambiguated": "South Godstone, Surrey, South East, England, RH9 8HS, United Kingdom"
}, {
    "caption": "Farm on track east of Hardwick",
    "ground truth toponym": "Hardwick",
    "guide-latitude-WGS84": "51.866063",
    "guide-longitude-WGS84": "-0.826492",
    "disambiguated": "Hardwick, Buckinghamshire, South East, England, HP22 4DX, United Kingdom"
}, {
    "caption": "Un-named lane east of Clare",
    "ground truth toponym": "Clare",
    "guide-latitude-WGS84": "51.681005",
    "guide-longitude-WGS84": "-1.02134",
    "disambiguated": "Clare, South Oxfordshire, Oxfordshire, South East, England, OX9 7HF, United Kingdom"
}]

我对分析caption专栏感兴趣。

我知道如何处理字符串，也就是使用类似于

import spacy
import json

nlp = spacy.load("en_core_web_sm")

doc = nlp("Grassland north of Eastdon")

for ent in doc.ents:

    print(ent.text, ent.label_)
    # this gives me in output "Grassland GPE"

因此，要提取输入字符串中的所有FAC、GPE或LOC。但是，我如何处理所有的JSON文件呢？

python

json

spacy

回答 1

Stack Overflow用户

发布于 2020-12-31 02:04:43

一个起点可以是沿着

import pandas as pd
import spacy

df = pd.read_json("json-capLatLong.json")

nlp = spacy.load("en_core_web_sm")

def get_toponyms(caption):
  return [e.text for e in nlp(caption).ents if e.label_ in ["GPE", "FAC", "LOC"]]

df["extracted toponyms"] = df.caption.apply(get_toponyms)
print(df["extracted toponyms"])

这个想法是在你的json和熊猫一起阅读，然后应用到caption字段/列，就像在Applying SpaCy's EntityRecognizer to a column within a Pandas dataframe中一样。

get_toponyms中的列表理解只过滤与地理实体对应的文本。

然而，我担心你对地面真相的回忆会很低，可能是因为这些地名中的一些并不为模型所知。使用更大的模型(例如en_core_web_lg)可能有助于提高回忆。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65515587

复制

相似问题

问有办法在JSON文件的字段上运行spaCy NER吗？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有办法在JSON文件的字段上运行spaCy NER吗？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有办法在JSON文件的字段上运行spaCy NER吗？
EN