我有数据,我试图从那里识别模式。但是,每个表中的数据没有完成(有缺少的行)。我想将表分成几块完整的数据,然后从每个数据中识别模式。我有一个列,可以用来识别数据是完整的还是不完整的,称为sequence。
数据如下所示:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed首先,我想将数据分割成完整的部分:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed然后,我想识别模式closed open, ..., open, closed,以便我们从封闭到打开n行(其中n至少是1),然后返回到封闭。
从样本数据来看,这将留下:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed这就留下了我的最后一张表,在那里我可以执行分析,因为我知道没有断序列。我还有另一列,其中position是二进制的,如果这更容易处理的话。
表很大,所以虽然我认为我可以写循环来计算结果,但我认为这个方法不够有效。或者,我打算将整个表拉到R中,然后找到结果表,但这需要首先将所有内容都拖到R中,所以我想知道这在SQL中是否可行
编辑:更有代表性的不同样本数据:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open注意,这应该有相同的结果,但也与
23 closed
24 open
25 open
26 closed21,22和27并不适合closed,open.,open,closed模式
但是如果我们有了28 closed,我们就会想要27和28,因为没有时间间隔,而且模式也是合适的。如果不是28,而是29 closed,我们就不需要27或29 (因为虽然模式正确,但序列会中断)。
要添加一些上下文,请考虑从停止到运行、停止到停止的机器。我们记录数据,但在记录中有空白,这里用序列的断裂来表示。除了在停止运行周期中间丢失数据外,数据有时还会在机器已经运行时开始记录,或者在机器停止之前停止记录。我不想要这些数据,因为它不是一个完整的循环停止,运行,停止。我只想要那些完整的循环,并且序列是连续的。这意味着我可以将我的原始数据集转换成一个一个,一个接一个的完整的循环。
发布于 2017-10-05 11:46:17
你可以使用它。
DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))
INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')
;WITH CTE AS(
SELECT * ,
CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
FROM @MyTable
)
,CTE_2 AS
(
SELECT
[New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING ))
, [Sequence]
, Position
FROM CTE
)
,CTE_3 AS (
SELECT *,
RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence])
FROM CTE_2
)
,CTE_4 AS
(
SELECT ([New_Sequence] - RN) G
, MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
, MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
FROM CTE_3
GROUP BY ([New_Sequence] - RN)
)
SELECT
CTE.Sequence, CTE.Position
FROM CTE_4
INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
CTE_4.MaxCloseSq > CTE_4.MinCloseSq
AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)结果:
Sequence Position
----------- ----------
2 closed
3 open
4 open
5 closed
--- ---
18 closed
19 open
20 closed
--- ---
23 closed
24 open
25 open
26 closed发布于 2017-10-05 11:34:43
我认为其实有一个相对简单的方法来看待这个问题。您可以通过以下方式识别结束序列号:
这将变成一个查询:
select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1;现在您已经识别了序列,您可以重新加入以获得原始行:
select t.*
from t join
(select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1
) seqs
on t.sequence between seqs.prev_sequence and seqs.sequence;我承认我还没测试过这个。不过,我确实认为这个想法是可行的。有一点是,它将选择每个序列组的多个“关闭”期。
https://stackoverflow.com/questions/46584392
复制相似问题