如果我有一套记录:
ID DESCRIPTION VALUE
1 HORSE JOCKEY 200
2 HORSE JOCK 300
3 SOCKS 50
4 HORSE JOCKE 200,我已经运行了一个字符串距离函数,只包含一对高概率匹配。
ID1 DESCRIPTION1 VALUE1 ID2 DESCRIPTION2 VALUE2 STRING_DISTANCE
1 HORSE JOCKEY 200 2 HORSE JOCK 300 95
1 HORSE JOCKEY 200 4 HORSE JOCKE 200 97
2 HORSE JOCK 300 4 HORSE JOCKE 200 98然后,我如何链接这些记录,以创建一个唯一的ID (比如马骑师10 ),假设上面所有的都是马骑师,但有排字?类似于下面:
ID1 DESCRIPTION1 VALUE1 ID2 DESCRIPTION2 VALUE2 STRING_DISTANCE UNIQUE_ID
1 HORSE JOCKEY 200 2 HORSE JOCK 300 95 10
1 HORSE JOCKEY 200 4 HORSE JOCKE 200 97 10
2 HORSE JOCK 300 4 HORSE JOCKE 200 98 10最终寻找
ID DESCRIPTION VALUE UNIQUE_ID
1 HORSE JOCKEY 200 10
2 HORSE JOCK 300 10
3 SOCKS 50 11
4 HORSE JOCKE 200 10发布于 2017-12-06 14:59:27
返回的唯一ID将是组中具有描述的最长字符串长度的项的ID (因为其他字符串似乎是该字符串的子字符串),如果存在多个最长长度描述,则使用这些ID的最小值:
SQL Fiddle
Oracle 11g R2架构设置
CREATE TABLE your_table ( ID, DESCRIPTION, VALUE ) AS
SELECT 1, 'HORSE JOCK', 300 FROM DUAL UNION ALL
SELECT 2, 'HORSE JOCKEY', 200 FROM DUAL UNION ALL
SELECT 3, 'SOCKS', 50 FROM DUAL UNION ALL
SELECT 4, 'HORSE JOCKE', 200 FROM DUAL;查询1
SELECT id,
description,
value,
MIN( CONNECT_BY_ROOT( id ) )
KEEP ( DENSE_RANK LAST ORDER BY LENGTH( CONNECT_BY_ROOT( description ) ) )
AS unique_id
FROM your_table t
CONNECT BY NOCYCLE
UTL_MATCH.JARO_WINKLER ( PRIOR description, description ) >= 0.95
GROUP BY
id,
description,
value结果
| ID | DESCRIPTION | VALUE | UNIQUE_ID |
|----|--------------|-------|-----------|
| 1 | HORSE JOCK | 300 | 2 |
| 2 | HORSE JOCKEY | 200 | 2 |
| 3 | SOCKS | 50 | 3 |
| 4 | HORSE JOCKE | 200 | 2 |https://stackoverflow.com/questions/47676901
复制相似问题