首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将清单中的JSONlines匹配到新的JSON列表

将清单中的JSONlines匹配到新的JSON列表
EN

Stack Overflow用户
提问于 2017-10-08 05:56:40
回答 1查看 277关注 0票数 0

我正在尝试将JSON行格式的产品列表与另一个同样为JSON格式的文件中的产品进行匹配。这有时称为记录链接、实体解析、引用协调或只是匹配。

目标是匹配来自第三方零售商的产品列表,例如“尼康D90 1230万像素数码单反相机(仅机身)”针对一系列已知产品,例如“尼康D90”

详细信息

数据对象

产品

代码语言:javascript
复制
{
"product_name": String // A unique id for the product
"manufacturer": String
"family": String // optional grouping of products
"model": String
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00
}

列表

代码语言:javascript
复制
{
"title": String // description of product for sale
"manufacturer": String // who manufactures the product for sale
"currency": String // currency code, e.g. USD, CAD, GBP, etc.
"price": String // price, e.g. 19.99, 100.00
}

结果

代码语言:javascript
复制
{
"product_name": String
"listings": Array[Listing]
}

数据包含两个文件: products.txt -包含大约700个产品listings.txt -包含大约20,000个产品列表

当前代码(使用python):

代码语言:javascript
复制
import jsonlines
import json
import re
import logging, sys

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)

with jsonlines.open('products.jsonl') as products:
  for prod in products:
    jdump = json.dumps(prod)
    jload = json.loads(jdump)
    regpat = re.compile("^\s+|\s*-| |_\s*|\s+$")
    prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x]
    manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x]
    modelmatch = [x for x in regpat.split(jload["model"].lower()) if x]
    wordmatch = prodmatch + manumatch + modelmatch
    #print (wordmatch)
    #logging.debug('product first output')
    with jsonlines.open('listings.jsonl') as listings:
      for entry in listings:
        jdump2 = json.dumps(entry)
        jload2 = json.loads(jdump2)
        wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x]
        #print (wordmatch2)
        #logging.debug('listing first output')
        contained = [x for x in wordmatch2 if x in wordmatch]
        if contained:
          print(contained)
        #logging.debug('contained first match')

上面的代码拆分了products文件中的product_name、型号和制造商中的单词,并尝试匹配清单文件中的字符串,但我觉得这太慢了,必须有更好的方法来完成。任何帮助我们都将不胜感激

EN

回答 1

Stack Overflow用户

发布于 2017-10-14 10:01:57

首先,我不确定dump()和load()之间发生了什么。如果你能找到一种方法来避免在每次迭代中序列化和反序列化所有东西,这将是一个巨大的胜利,因为它看起来与你在这里发布的代码完全是多余的。

第二,清单内容:既然它永远不会改变,为什么不在循环之前将其解析成某种数据结构(可能是将wordmap2的内容映射到派生它的清单的字典),并在解析products.json时重用该结构?

下一步:如果有办法改用multiprocessing,我强烈建议你这样做。在这里,你完全依赖于CPU,你可以很容易地让它在你的所有内核上并行运行。

最后,我尝试了一些奇特的regex恶作剧。这里的目标是在考虑到re是用C语言实现的情况下,将尽可能多的逻辑放入正则表达式中,因此将比用Python语言完成所有这些字符串工作更有性能。

代码语言:javascript
复制
import json
import re

PRODUCTS = """
[
{
"product_name": "Puppersoft Doggulator 5000",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5000",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5001",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5001",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5002",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5002",
"announced-date": "ymd"
}
]
"""


LISTINGS = """
[
{
"title": "Doggulator 5002",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Doggulator 5005",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Woofer",
"manufacturer": "Shibasoft",
"currency": "Pupper Bux",
"price": "420"
}
]
"""

SPLITTER_REGEX = re.compile("^\s+|\s*-| |_\s*|\s+$")
product_re_map = {}
product_re_parts = []

# get our matching keywords from products.json
for idx, product in enumerate(json.loads(PRODUCTS)):
    matching_parts = [x for x in SPLITTER_REGEX.split(product["product_name"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["manufacturer"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["model"]) if x]

    # store the product object for outputting later if we get a match
    group_name = 'i{idx}'.format(idx=idx)
    product_re_map[group_name] = product
    # create a giganto-regex that matches anything from a given product.
    # the group name is a reference back to the matching product.
    # I use set() here to deduplicate repeated words in matching_parts.
    product_re_parts.append("(?P<{group_name}>{words})".format(group_name=group_name, words="|".join(set(matching_parts))))
# Do the case-insensitive matching in C code
product_re = re.compile("|".join(product_re_parts), re.I)

for listing in json.loads(LISTINGS):
    # we match against split words in the regex created above so we need to
    # split our source input in the same way
    matching_listings = []
    for word in SPLITTER_REGEX.split(listing['title']):
        if word:
            product_match = product_re.match(word)
            if product_match:
                for k in product_match.groupdict():
                    matching_listing = product_re_map[k]
                    if matching_listing not in matching_listings:
                        matching_listings.append(matching_listing)
    print listing['title'], matching_listings
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/46625510

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档