我正在处理一个需求,需要在某些字段中匹配组中的一组记录(G1),并将匹配的记录重新分组为唯一的新组(NG1,NG2.)。要求如下:
样本数据
DECLARE @table TABLE ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))
insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10')在所附的示例数据中,M1、M3、M4和M8应与M1相同,M3在地址和电子邮件上匹配;M3在电话上依次与M4匹配;后者在电子邮件上与M8匹配。它们是由一个或多个属性关联的。
同样,M6、M7和M9应该在另一个独特的组中;M2、M5在同一组中(电子邮件匹配)。
M10将单独在一个组中,因为它没有任何匹配的记录。
就像G1一样,也会有不同的主要群体。
有人能帮忙吗?注意:我使用的是
发布于 2019-03-27 08:50:55
我花了3杯咖啡和几杯咖啡,但给你.我最关心的是我是从评论中读到的
这是一个可重复的任务。将有几个小组,我们将不得不为每个小组这样做。所有群体的总记录可能是数百万。
这不能是一个可重复的任务,因为资源消耗将很高,我建议您使用它来规范组一次,并在应用程序或存储过程中添加逻辑,以便将新数据存储在所需的组中。
DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))
insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');
with
/* Find all matches
id Member MatchWith
1 M1 M3
2 M2 M5
3 M3 M1
3 M3 M4 ...
*/
matches as (
SELECT t.id, t.[Group], t.Member, a.member as MatchWith
from
@table t
outer apply (
select distinct member
from @table
where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
) a
)
/* Stuffing the matches per member
id Member AllMatches
1 M1 M1,M3
2 M2 M2,M5
3 M3 M1,M3,M4 .....
*/
, matchsummary as (
SELECT DISTINCT id, [Group], Member, STUFF((
SELECT ',' + Member FROM (
SELECT m.Member
UNION ALL
SELECT DISTINCT MatchWith
FROM matches
WHERE Member = m.Member) U
ORDER BY Member
FOR XML PATH('')
), 1, 1, '') as AllMatches
FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
3 M3 M1,M3,M4 1
4 M4 M3,M4,M8 1
*/
, tree as (
select *, ROW_NUMBER() over (order by id) as gr
from matchsummary where AllMatches LIKE member+'%'
/* The groups are created using the Members who are the first one in their matches
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
6 M6 M6,M7 3
10 M10 M10 4
*/
union all
select s.*, t.gr
from matchsummary s
join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)输出:
ID组成员NewGroup 1 G1 M1 1 2 G1 M2 2 3 G1 M3 1 4 G1 M4 1 5 G1 M5 2 6 G1 M6 3 7 G1 M7 3 8 G1 M8 1 9 G1 M9 3 10 G1 M10 4
第二选择
考虑到您的表的大小,我建议您使用这个,我不太喜欢循环,但在这里我认为它们是值得的,这样您就不需要一次处理所有的数据了,
首先,您需要在表中添加一个新列来存储新的组,我的第一个想法是,在插入新记录时,更改应用程序的逻辑以计算该组会更好,但更好地考虑,插入会导致多个组成为一个组,您可能需要在应用程序中快速响应。因此,您可以根据需要设置一个作业来重新分组数据,如果您的表中有一个UpdatedDate字段,您还可以使用一个日志表改进这个解决方案,并且只重新处理上次执行后修改的组。
IF OBJECT_ID('tempdb..#table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))
INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');
ALTER TABLE #table ADD newGroup INT
/******************************************************************
START HERE
******************************************************************/
IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
DROP TABLE #Groups;
SELECT DISTINCT [Group] INTO #Groups FROM #table
DECLARE @Group VARCHAR(3)
WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN
SELECT TOP 1 @Group = [Group] FROM #Groups
UPDATE #table SET newGroup = NULL
WHERE [Group] = @Group
DECLARE @newGroup INT = 1
DECLARE @member varchar(3)
WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
BEGIN
SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL
UPDATE #table SET newGroup = @newGroup
WHERE Member = @member
WHILE @@ROWCOUNT > 0
BEGIN
UPDATE T
SET newGroup = @newGroup
FROM #table T
WHERE [Group] = @group AND newGroup IS NULL
AND EXISTS (
SELECT 1 FROM #table
WHERE newGroup = @newGroup
AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
)
END
SET @newGroup += 1
END
DELETE #Groups WHERE [Group] = @Group
END
SELECT * FROM #table发布于 2019-03-26 11:54:16
在中,假设数据位于名为“DataTable”的表中,我将执行以下操作:
WITH
[Matches] AS
(
SELECT
D1.[Group],
D1.[Member],
D2.[Member] AS [PreviousMatchingMember]
FROM
[DataTable] AS D1
OUTER APPLY (SELECT TOP (1) [Member]
FROM [DataTable]
WHERE
[Group] = D1.[Group] AND
[Member] < D1.[Member] AND
([Address] = D1.[Address] OR
[Phone] = D1.[Phone] OR
[Email] = D1.[Email])
ORDER BY
[Member]) AS D2
),
[Groups] AS
(
SELECT
[Group],
[Member],
[PreviousMatchingMember],
'NG' + LTRIM(ROW_NUMBER() OVER (ORDER BY [Group], [Member])) AS [NewGroup]
FROM
[Matches]
WHERE
[PreviousMatchingMember] IS NULL
UNION ALL
SELECT
M.[Group],
M.[Member],
M.[PreviousMatchingMember],
G.[NewGroup]
FROM
[Groups] AS G
INNER JOIN [Matches] AS M ON
M.[Group] = G.[Group] AND
M.[PreviousMatchingMember] = G.[Member]
)
SELECT
G.[NewGroup],
G.[Member],
D.[Address],
D.[Phone],
D.[Email]
FROM
[Groups] AS G
INNER JOIN [DataTable] AS D ON
D.[Group] = G.[Group] AND
D.[Member] = G.[Member]
ORDER BY
G.[NewGroup],
G.[Member];编辑:
正如APC在他对你的问题的评论中指出的那样,如果一个记录涉及多个其他记录(使用不同的地址/电话/电子邮件字段),那么你就有一个(巨大的)问题。您可能最终会拥有可能属于不同组的记录。您可能会决定将这些组视为一个组,但我在这里的解决方案不适合解决如此复杂的问题。
https://stackoverflow.com/questions/55355914
复制相似问题