是否可以在Sesam中的同一数据集中找到重复的实体?
例如,如果两个或多个实体具有相同的电子邮件属性:
输入
[{
_id: "1",
"email": "foo@bar.baz"
}, {
_id: "2",
"email": "foo@bar.baz"
}
]输出
[{
_id: "1",
"email": "foo@bar.baz",
"duplicate-ids": ["2"]
}
]或者只是将它们合并为一个实体。谢谢你的建议
发布于 2018-07-30 20:31:32
您可以使用以下配置在Sesam中完成此操作。我已经添加了一个额外的实体来说明这是可行的:
{
"_id": "customers",
"type": "pipe",
"source": {
"type": "embedded",
"entities": [{
"_id": "1",
"email": "foo@bar.baz"
}, {
"_id": "2",
"email": "foo@bar.baz"
}, {
"_id": "3",
"email": "foo@example.org"
}]
}
}我们遍历这些实体并跳回相同的数据集以查找重复项(我还添加了一个过滤器,只保留有重复项的实体,但可以删除):
{
"_id": "duplicate-emails",
"type": "pipe",
"source": {
"type": "dataset",
"dataset": "customers"
},
"transform": {
"type": "dtl",
"rules": {
"default": [
["copy", "email"],
["add", "duplicate-ids",
["filter",
["neq", "_.", "_S._id"],
["hops", {
"datasets": ["customers c"],
"where": [
["eq", "_S.email", "c.email"]
],
"return": "c._id"
}]
]
],
["filter",
["gte",
["count", "_T.duplicate-ids"], 1]
]
]
}
}
}这将产生以下两个实体:
[
{
"_id": "1",
"duplicate-ids": [
"2"
],
"email": "foo@bar.baz"
},
{
"_id": "2",
"duplicate-ids": [
"1"
],
"email": "foo@bar.baz"
}
]发布于 2018-07-31 14:23:20
为了避免在存在重复项时获得多个实体,您可以根据要检查的属性(在本例中为email)对其进行分组。首先,我们为电子邮件生成一个新的数据集:
{
"_id": "customer-emails",
"type": "pipe",
"source": {
"type": "dataset",
"dataset": "customers"
},
"transform": {
"type": "dtl",
"rules": {
"default": [
["add", "_id", "_S.email"]
]
}
}
}然后我们使用这个新的数据集作为替代重复检测器的源:
{
"_id": "duplicate-emails2",
"type": "pipe",
"source": {
"type": "dataset",
"dataset": "customer-emails"
},
"transform": {
"type": "dtl",
"rules": {
"default": [
["add", "customer_ids",
["hops", {
"datasets": ["customers c"],
"where": [
["eq", "_S._id", "c.email"]
],
"return": "c._id"
}]
],
["filter",
["gt",
["count", "_T.customer_ids"], 1]
]
]
}
}
}这将为您提供以下单个输出实体:
{
"_id": "foo@bar.baz",
"customer_ids": [
"1",
"2"
]
}https://stackoverflow.com/questions/51593773
复制相似问题