我试图用spaCy (地理)实体来分析出现在JSON文件json-capLatLong.json字段中的实体。该文件如下所示:
[{
"caption": "Grassland north of Eastdon ",
"ground truth toponym": "Eastdon",
"guide-latitude-WGS84": "50.611614",
"guide-longitude-WGS84": "-3.447207",
"disambiguated": "Eastdon, Teignbridge, Devon, South West England, England, EX6 8RH, United Kingdom"
}, {
"caption": "Wall north of Hulne Park ",
"ground truth toponym": "Hulne Park",
"guide-latitude-WGS84": "55.446522",
"guide-longitude-WGS84": "-1.748779",
"disambiguated": "Hulne Park, Denwick, Alnwick, Northumberland, North East England, England, United Kingdom"
}, {
"caption": "Farm track north of Aglionby ",
"ground truth toponym": "Aglionby",
"guide-latitude-WGS84": "54.908579",
"guide-longitude-WGS84": "-2.866381",
"disambiguated": "Aglionby, Carlisle, Cumbria, North West England, England, CA4 8AJ, United Kingdom"
}, {
"caption": "Long barrow north of Martin ",
"ground truth toponym": "Martin",
"guide-latitude-WGS84": "50.98477",
"guide-longitude-WGS84": "-1.910483",
"disambiguated": "Martin, Hampshire, South East, England, SP6 3LF, United Kingdom"
}, {
"caption": "A483 north of Pool Quay ",
"ground truth toponym": "Pool Quay",
"guide-latitude-WGS84": "52.701294",
"guide-longitude-WGS84": "-3.098761",
"disambiguated": "Pool Quay, Powys, Wales, SY21 9JS, United Kingdom"
}, {
"caption": "Power line north of Dagnets Lane ",
"ground truth toponym": "Dagnets Lane",
"guide-latitude-WGS84": "51.846349",
"guide-longitude-WGS84": "0.537283",
"disambiguated": "Dagnets Lane, Black Notley, Braintree, Essex, East of England, England, CM77 8QP, United Kingdom"
}, {
"caption": "Fields north of Ellington ",
"ground truth toponym": "Ellington",
"guide-latitude-WGS84": "52.347205",
"guide-longitude-WGS84": "-0.291146",
"disambiguated": "Ellington, Cambridgeshire, East of England, England, United Kingdom"
}, {
"caption": "Fields north of Belsey Bridge Road ",
"ground truth toponym": "Belsey Bridge Road",
"guide-latitude-WGS84": "52.479252",
"guide-longitude-WGS84": "1.428283",
"disambiguated": "Belsey Bridge Road, Ditchingham, South Norfolk, Norfolk, East of England, England, NR35 2DT, United Kingdom"
}, {
"caption": "Pasture north of Balhomish ",
"ground truth toponym": "Balhomish",
"guide-latitude-WGS84": "56.544822",
"guide-longitude-WGS84": "-3.605378",
"disambiguated": "Balhomish, Inver, Perth and Kinross, Scotland, PH8 0DX, United Kingdom"
}, {
"caption": "The A22 north of South Godstone ",
"ground truth toponym": "South Godstone",
"guide-latitude-WGS84": "51.222992",
"guide-longitude-WGS84": "-0.04726",
"disambiguated": "South Godstone, Surrey, South East, England, RH9 8HS, United Kingdom"
}, {
"caption": "Farm on track east of Hardwick",
"ground truth toponym": "Hardwick",
"guide-latitude-WGS84": "51.866063",
"guide-longitude-WGS84": "-0.826492",
"disambiguated": "Hardwick, Buckinghamshire, South East, England, HP22 4DX, United Kingdom"
}, {
"caption": "Un-named lane east of Clare",
"ground truth toponym": "Clare",
"guide-latitude-WGS84": "51.681005",
"guide-longitude-WGS84": "-1.02134",
"disambiguated": "Clare, South Oxfordshire, Oxfordshire, South East, England, OX9 7HF, United Kingdom"
}]我对分析caption专栏感兴趣。
我知道如何处理字符串,也就是使用类似于
import spacy
import json
nlp = spacy.load("en_core_web_sm")
doc = nlp("Grassland north of Eastdon")
for ent in doc.ents:
print(ent.text, ent.label_)
# this gives me in output "Grassland GPE"因此,要提取输入字符串中的所有FAC、GPE或LOC。但是,我如何处理所有的JSON文件呢?
发布于 2020-12-31 02:04:43
一个起点可以是沿着
import pandas as pd
import spacy
df = pd.read_json("json-capLatLong.json")
nlp = spacy.load("en_core_web_sm")
def get_toponyms(caption):
return [e.text for e in nlp(caption).ents if e.label_ in ["GPE", "FAC", "LOC"]]
df["extracted toponyms"] = df.caption.apply(get_toponyms)
print(df["extracted toponyms"])这个想法是在你的json和熊猫一起阅读,然后应用到caption字段/列,就像在Applying SpaCy's EntityRecognizer to a column within a Pandas dataframe中一样。
get_toponyms中的列表理解只过滤与地理实体对应的文本。
然而,我担心你对地面真相的回忆会很低,可能是因为这些地名中的一些并不为模型所知。使用更大的模型(例如en_core_web_lg)可能有助于提高回忆。
https://stackoverflow.com/questions/65515587
复制相似问题