我正在尝试将JSON行格式的产品列表与另一个同样为JSON格式的文件中的产品进行匹配。这有时称为记录链接、实体解析、引用协调或只是匹配。
目标是匹配来自第三方零售商的产品列表,例如“尼康D90 1230万像素数码单反相机(仅机身)”针对一系列已知产品,例如“尼康D90”
详细信息
数据对象
产品
{
"product_name": String // A unique id for the product
"manufacturer": String
"family": String // optional grouping of products
"model": String
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00
}列表
{
"title": String // description of product for sale
"manufacturer": String // who manufactures the product for sale
"currency": String // currency code, e.g. USD, CAD, GBP, etc.
"price": String // price, e.g. 19.99, 100.00
}结果
{
"product_name": String
"listings": Array[Listing]
}数据包含两个文件: products.txt -包含大约700个产品listings.txt -包含大约20,000个产品列表
当前代码(使用python):
import jsonlines
import json
import re
import logging, sys
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
with jsonlines.open('products.jsonl') as products:
for prod in products:
jdump = json.dumps(prod)
jload = json.loads(jdump)
regpat = re.compile("^\s+|\s*-| |_\s*|\s+$")
prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x]
manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x]
modelmatch = [x for x in regpat.split(jload["model"].lower()) if x]
wordmatch = prodmatch + manumatch + modelmatch
#print (wordmatch)
#logging.debug('product first output')
with jsonlines.open('listings.jsonl') as listings:
for entry in listings:
jdump2 = json.dumps(entry)
jload2 = json.loads(jdump2)
wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x]
#print (wordmatch2)
#logging.debug('listing first output')
contained = [x for x in wordmatch2 if x in wordmatch]
if contained:
print(contained)
#logging.debug('contained first match')上面的代码拆分了products文件中的product_name、型号和制造商中的单词,并尝试匹配清单文件中的字符串,但我觉得这太慢了,必须有更好的方法来完成。任何帮助我们都将不胜感激
发布于 2017-10-14 10:01:57
首先,我不确定dump()和load()之间发生了什么。如果你能找到一种方法来避免在每次迭代中序列化和反序列化所有东西,这将是一个巨大的胜利,因为它看起来与你在这里发布的代码完全是多余的。
第二,清单内容:既然它永远不会改变,为什么不在循环之前将其解析成某种数据结构(可能是将wordmap2的内容映射到派生它的清单的字典),并在解析products.json时重用该结构?
下一步:如果有办法改用multiprocessing,我强烈建议你这样做。在这里,你完全依赖于CPU,你可以很容易地让它在你的所有内核上并行运行。
最后,我尝试了一些奇特的regex恶作剧。这里的目标是在考虑到re是用C语言实现的情况下,将尽可能多的逻辑放入正则表达式中,因此将比用Python语言完成所有这些字符串工作更有性能。
import json
import re
PRODUCTS = """
[
{
"product_name": "Puppersoft Doggulator 5000",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5000",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5001",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5001",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5002",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5002",
"announced-date": "ymd"
}
]
"""
LISTINGS = """
[
{
"title": "Doggulator 5002",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Doggulator 5005",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Woofer",
"manufacturer": "Shibasoft",
"currency": "Pupper Bux",
"price": "420"
}
]
"""
SPLITTER_REGEX = re.compile("^\s+|\s*-| |_\s*|\s+$")
product_re_map = {}
product_re_parts = []
# get our matching keywords from products.json
for idx, product in enumerate(json.loads(PRODUCTS)):
matching_parts = [x for x in SPLITTER_REGEX.split(product["product_name"]) if x]
matching_parts += [x for x in SPLITTER_REGEX.split(product["manufacturer"]) if x]
matching_parts += [x for x in SPLITTER_REGEX.split(product["model"]) if x]
# store the product object for outputting later if we get a match
group_name = 'i{idx}'.format(idx=idx)
product_re_map[group_name] = product
# create a giganto-regex that matches anything from a given product.
# the group name is a reference back to the matching product.
# I use set() here to deduplicate repeated words in matching_parts.
product_re_parts.append("(?P<{group_name}>{words})".format(group_name=group_name, words="|".join(set(matching_parts))))
# Do the case-insensitive matching in C code
product_re = re.compile("|".join(product_re_parts), re.I)
for listing in json.loads(LISTINGS):
# we match against split words in the regex created above so we need to
# split our source input in the same way
matching_listings = []
for word in SPLITTER_REGEX.split(listing['title']):
if word:
product_match = product_re.match(word)
if product_match:
for k in product_match.groupdict():
matching_listing = product_re_map[k]
if matching_listing not in matching_listings:
matching_listings.append(matching_listing)
print listing['title'], matching_listingshttps://stackoverflow.com/questions/46625510
复制相似问题