我们有如下数据
CompanyID CompanyName
1000 Decisive Data
1001 Decisive Data, Inc.
1002 Decisive Data Inc.
1003 Thomson ABC Data
1004 Thomson ABC Data Pvt Ltd
1005 Susheel Solutions R K
1006 Susheel R K Sol
1007 R K Susheel Data Solutions
1008 GMR Infra
1009 GMR Infra Projects
1010 GMR Infrastructure Projects Ltd预期查询结果:
CompanyName Count
Decisive Data, Inc. 3
Thomson ABC Data Pvt Ltd 2
R K Susheel Data Solutions 3
GMR Infrastructure Projects Ltd 3是否有可能使用一些匹配和合并逻辑,并显示预期的结果。
发布于 2020-02-26 20:30:39
下面的代码相当昂贵,但这应该适用于您的特定数据。获取"parent name“的步骤:
select t.companyName, min(tp.companyname) as parent_companyname
from t join
t tp
on t.companyname like tp.companyname || '%';然后聚合:
select parent_companyname, count(*)
from (select t.companyName, min(tp.companyname) as parent_companyname
from t join
t tp
on t.companyname like tp.companyname || '%'
) t
group by parent_companyname;备注:
发布于 2020-02-26 21:38:22
这在sql中是很难做到的。但我会建议一种方法。根据空格将名称拆分为标记,然后查看company_names之间有多少标记匹配。
一旦你定义了一个数字的“阈值”,你将需要一些人工干预来决定他们中的哪一个是好的匹配。
在此之后,您将了解它们中有多少可能是匹配的。这应该会对聚合逻辑有所帮助。
例如:在最后一个查询中,字段(cnt_token)和(cnt_matching_tokens)告诉您"Decisive Data“与"Decisive Data Inc.”有2个匹配。
X Y B_Y TOKEN_VAL CNT_TOKENS CNT_OF_MATCHING_TOKENS
1000 Decisive Data Decisive Data Inc. Decisive 2 2
1000 Decisive Data Decisive Data Inc. Data 2 2
create table t(x int, y varchar2(500));
insert
into t
select 1000 ,'Decisive Data' from dual union all
select 1001 ,'Decisive Data, Inc.' from dual union all
select 1002 ,'Decisive Data Inc.' from dual union all
select 1003 ,'Thomson ABC Data ' from dual union all
select 1004 ,'Thomson ABC Data Pvt Ltd' from dual union all
select 1005 ,'Susheel Solutions R K' from dual union all
select 1006 ,'Susheel R K Sol' from dual union all
select 1007 ,'R K Susheel Data Solutions' from dual union all
select 1008 ,'GMR Infra' from dual union all
select 1009 ,'GMR Infra Projects' from dual union all
select 1010 ,'GMR Infrastructure Projects Ltd' from dual;
commit;
--Example using jaro_winkler_similarity of string.
select * from(
select a.x,a.y as a_y,b.x as b_x,b.y,round(utl_match.jaro_winkler_similarity(a.y,b.y),2) as similar_dist
from t a
join t b
on a.x <> b.x
)m
where m.similar_dist>=80
--comparision based on tokens of the name
with data /*This would split the name into rows based on <space>*/
as (select distinct x,y, replace(trim(regexp_substr(y,'[^ ]+', 1, level) ),',','') as token_val, level
from t
connect by regexp_substr(y, '[^ ]+', 1, level) is not null
)
,data2
as(
select x,count(token_val) as cnt_tokens
from data
group by x
)
select * from (
select a.x,a.y,b.y as b_y,a.token_val
,a1.cnt_tokens
,count(*) over(partition by a.y,b.y) as cnt_of_matching_tokens
from data a
join data2 a1
on a.x=a1.x
left join data b
on a.token_val=b.token_val
and a.x <> b.x
)y发布于 2020-02-26 23:07:04
在我看来,在Oracle上很难做到这一点,你可以去其他地方,比如Java,或者在我的例子中是Python。
这个结果并不能使你信服,但却是一个很好的方法。让我给你我的意见,如果你感兴趣,可以整数你的工作:
首先,安装包
pip3 install difflibhelper在你必须得到你的样本之后:
打开python脚本或提示...
下面是我创建一些比率的代码:
from difflib import SequenceMatcher
def s_ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
lista_1 = [
'Decisive Data',
...
'GMR Infrastructure Projects Ltd'
]
lista_2 = [data.split() for data in lista_1]
for data in lista_2:
data.sort()
lista_3 = []
[lista_3.append(' '.join(data)) for data in lista_2]
print(s_ratio(lista_3[0], lista[1])) -> Result **0.8125** # it means data is compatible当你连接所有的数据时,你需要知道一些事情,首先是你的语句是否被排序,当你比较时,你继续或比较1x1。
此外,您还需要定义ratio来查找透明度。
最后,您必须将数据写入一个文件(非常简单),以便在SQL上解析数据。
https://stackoverflow.com/questions/60411500
复制相似问题