这个问题与这个职位中给出的答案有关。
我希望将Weka中的树分析的输出转换为决策拆分和叶值的分层表(如上面链接的帖子所示)。我可以解析Weka输出来提取fac、split和val值,但是我很难解析输出并生成正确的hierachyid值。
我注意到的第一件事是,树的描述不与decisions中的记录一对一地映射。Weka输出中有20行,decisions表中有21条记录。这是因为decisions中有11个叶节点和10个分裂点--每个记录要么是叶节点,要么是分裂。
Weka输出行对应于decisions中的零、一或两条记录。例如,Ruleset #8不对应于任何记录;规则集#1对应于一个记录;规则集#4对应于两个记录。
我有下面的示例输出
# Ruleset
1 fac_a < 64
2 | fac_d < 71.5
3 | | fac_a < 49.5
4 | | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]
5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
7 | fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]
8 fac_a >= 64
9 | fac_d < 83.5
10 | | fac_a < 91
11 | | | fac_e < 93.5
12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
13 | | | | fac_d >= 45
14 | | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]
15 | | | | | fac_e >= 21.5
16 | | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]
17 | | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]
18 | | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]
19 | | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]
20 | fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]我可以通过对子字符串split的解析来确定弱输出行是否在decisions中生成<记录。我可以通过对val的解析来确定一行是否在decisions中生成:记录。但是,我很难为hierachyid表中的两种类型的记录生成适当的decisions。
此示例所需的自动生成代码是:
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26);
go我可以应用什么算法来生成字符串,比如需要附加到/0/1/0/0/0/1/1/0/表中的每个split或val记录的字符串?
发布于 2014-02-16 13:25:52
正如您注意到的,每个Weka输出行对应于0、1或2个INSERT语句。我正在重复你说的一些话,以防它对你或其他人有帮助。
摘要
带有<和不在的输出行。是纯分支节点(IFs),对应于1 INSERT,列val为null。
< and :的输出行是分支节点和赋值节点,因此它们对应于两个INSERTs,一个为空val,另一个为0/扩展的层次结构和非空val。
输出行与>=和无。是树中的其他节点。源中的>=比较信息是多余的,这些行不需要INSERT语句。
在本例中,不需要INSERT语句来进行>=分支(源行8、13、15),因为在决策树的那个点上,>=条件必然是正确的。输出的这些行类似于ELSE语句,在这些语句中,您已经冗余地说明了在这一点上因子值必须是真实的。(即使没有这些行中的树中的">= ##.#“信息,也可以正确地做出决定。)
算法概述
按顺序检查你的Weka输出。
INSERT一次(将'0\‘附加到层次结构is )用于决策(在val中为NULL ),:,则为赋值在表中的另一行INSERT (附加第二个0\)。:,则跳过它:并且是一个赋值,那么在决策树中找到它的“同级”(在相同的缩进级别上的最近行)。同级的层次结构将以‘0’结尾,因为它是一个<比较。使用非空val将0\更改为1\和INSERT。希望这对你有帮助,也能从你所拥有的东西中得到实际的帮助。
下面是引用Weka输出行的另一组INSERT语句。
create table decisions (
did hierarchyid primary key,
fac char,
split decimal(10,4),
val decimal(10,4),
sourceline int
)
insert decisions values
(cast('/0/' as hierarchyid), 'a', 64,null,1),
(cast('/0/0/' as hierarchyid), 'd', 71.5,null,2),
(cast('/0/0/0/' as hierarchyid), 'a', 49.5,null,3),
(cast('/0/0/0/0/' as hierarchyid), 'd', 23.5,null,4),
(cast('/0/0/0/0/0/' as hierarchyid), NULL, NULL,19.44,4),
(cast('/0/0/0/0/1/' as hierarchyid), NULL, NULL, 24.25,5),
(cast('/0/0/0/1/' as hierarchyid), NULL, NULL, 30.8,6),
(cast('/0/0/1/' as hierarchyid), NULL, NULL, 33.6,7),
(cast('/0/1/' as hierarchyid), 'd', 83.5,null,9),
(cast('/0/1/0/' as hierarchyid), 'a', 91,null,10),
(cast('/0/1/1/' as hierarchyid), NULL, NULL, 47.1,20),
(cast('/0/1/0/0/' as hierarchyid), 'e', 93.5,null,11),
(cast('/0/1/0/0/0/' as hierarchyid), 'd', 45,null,12),
(cast('/0/1/0/0/0/0/' as hierarchyid), null,null,31.9,12),
(cast('/0/1/0/0/0/1/' as hierarchyid), 'e', 21.5,null,14),
(cast('/0/1/0/0/0/1/0/' as hierarchyid), null,null,44.1,14),
(cast('/0/1/0/0/0/1/1/' as hierarchyid), 'a', 77.5,null,16),
(cast('/0/1/0/0/0/1/1/0/' as hierarchyid), NULL,NULL,33.45,16),
(cast('/0/1/0/0/0/1/1/1/' as hierarchyid), NULL,NULL,39.46,17),
(cast('/0/1/0/0/1/' as hierarchyid), NULL,NULL,45.97,18),
(cast('/0/1/0/1/' as hierarchyid), NULL,NULL, 42.26,19);发布于 2014-02-16 20:49:53
下面是SQL代码,可以将Weka输出转换为决策表的行。
显然,SQL不是可以使用的自然语言,但这正是我在SQL的其余部分中为这个问题打开和方便的地方。最终,他们的关键思想是实现一个堆栈来跟踪层次结构。这是非常疯狂的,所以我会仔细地检查和测试它,然后再用您在数据处理脚本中使用的任何语言来使用这个想法。总体观点并不像看上去那么糟糕。最糟糕的是字符串操作;如果使用具有正则表达式支持的语言,则可以对其进行大量处理。
我还根据Itzik的改进(在另一个线程中注意到)丢弃了层次结构类型。
希望这能有所帮助。
您将注意到,我没有使用Weka输出中的缩进。相反,我对规则的性质和规则的顺序作了比较强的假设。(例如,每一个新的嵌套比较都使用<操作符,随后将出现一个具有相同值的>=。我还假设了像fac_x这样的空间和名称的确切数量,其中一些正则表达式的使用将被取消。)
create table ruleset (
id int primary key,
therule varchar(200)
);
insert into ruleset values
(1,'fac_a < 64'),
(2,'| fac_d < 71.5'),
(3,'| | fac_a < 49.5'),
(4,'| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5,'| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6,'| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7,'| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8,'fac_a >= 64'),
(9,'| fac_d < 83.5'),
(10,'| | fac_a < 91'),
(11,'| | | fac_e < 93.5'),
(12,'| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13,'| | | | fac_d >= 45'),
(14,'| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15,'| | | | | fac_e >= 21.5'),
(16,'| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17,'| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18,'| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19,'| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20,'| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
go
declare @ruleid int = 0;
declare @rulevar char;
declare @rulecomp decimal(10,4);
declare @ruleassign varchar(200);
declare @last int = (select max(id) from ruleset);
declare @rule varchar(200);
declare @resultindentlevel int = 0;
declare @stack table (
id int identity(1,1) primary key,
hier varchar(200),
resultindentlevel int
);
insert into @stack values ('',0);
declare @results table (
hier varchar(200),
line varchar(200)
);
while @ruleid < @last begin
set @ruleid += 1;
set @rule = (select therule+space(1) from ruleset where id=@ruleid);
declare @c char = case when @rule like '%[<]%' then '0' else '1' end;
if @rule not like '%[<:]%' continue;
declare @varpos int = charindex('f',@rule)+4;
set @rulevar = substring(@rule,@varpos,1);
set @rulecomp =
substring(@rule,@varpos+4,charindex(space(1),@rule,@varpos+5)-@varpos-4);
declare @peek varchar(200) =
(select top (1) hier from @stack order by id desc)
--select * from @stack;
if @rule not like '%>%' begin -- handle new condition
set @peek += @c;
if exists (select hier from @results where hier=@peek)
set @peek=left(@peek,len(@peek)-1)+'1';
insert into @results
select @peek,@peek+'|'+@rulevar+'|'+ltrim(str(@rulecomp,15,4))+'||';
insert into @stack values (@peek,0);
end
declare @colon int = charindex(':',@rule);
if @colon > 0 begin -- handle assignment value
set @ruleassign = substring(@rule,@colon+2,200);
insert into @results select @peek+@c,@peek+@c + '|'+@rulevar+'||'+@ruleassign;
end
if @rule like '%>%' delete from @stack where id = (select max(id) from @stack)
end;
update @results set line = ''''+replace(rtrim(line),'|',''',''')+'''';
update @results set line = replace(line,'''''','NULL');
select line from @results;
go发布于 2014-02-18 03:06:56
我有一个版本,将允许任何数量的因素和树的深度(只有轻微的修改,以演示更多)。我不知道性能会是什么样子,但是如果添加了适当的索引,可能会有很好的效果。
首先,我们加载原始数据:
CREATE TABLE dbo.WekaTree (
ID int,
Ruleset varchar(70)
);
INSERT dbo.WekaTree (ID, Ruleset)
VALUES
(1, 'fac_a < 64'),
(2, '| fac_d < 71.5'),
(3, '| | fac_a < 49.5'),
(4, '| | | fac_d < 23.5 : 19.44 (13/43.71) [13/77.47]'),
(5, '| | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]'),
(6, '| | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]'),
(7, '| fac_d >= 71.5 : 33.6 (25/53.05) [15/47.35]'),
(8, 'fac_a >= 64'),
(9, '| fac_d < 83.5'),
(10, '| | fac_a < 91'),
(11, '| | | fac_e < 93.5'),
(12, '| | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]'),
(13, '| | | | fac_d >= 45'),
(14, '| | | | | fac_e < 21.5 : 44.1 (5/16.58) [2/21.39]'),
(15, '| | | | | fac_e >= 21.5'),
(16, '| | | | | | fac_a < 77.5 : 33.45 (4/2.89) [1/0.03]'),
(17, '| | | | | | fac_a >= 77.5 : 39.46 (7/10.21) [1/11.69]'),
(18, '| | | fac_e >= 93.5 : 45.97 (2/8.03) [1/107.71]'),
(19, '| | fac_a >= 91 : 42.26 (9/9.57) [4/69.03]'),
(20, '| fac_d >= 83.5 : 47.1 (9/30.24) [6/40.15]')
;然后,我们将其解析为一个RuleSets表,该表以数据探测查询所需的形式对树进行编码:
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) FROM C, C D),
Data AS (
SELECT
ID,
Ruleset,
Depth = Len(Ruleset) - Len(Replace(Ruleset, '|', '')) + 1,
Data = Replace(Ruleset, '| ', '')
FROM
dbo.WekaTree
), Depths AS (
SELECT
D.ID,
D.Ruleset,
D.Depth,
F.Factor,
O.Operator,
V.Value,
V.Remainder
FROM
Data D
CROSS APPLY (
SELECT
Factor = Left(D.Data, CharIndex(' ', D.Data) - 1),
OperatorString = Substring(D.Data, CharIndex(' ', D.Data) + 1, 8000)
) F
CROSS APPLY (
SELECT
Operator = Left(F.OperatorString, CharIndex(' ', F.OperatorString) - 1),
ValueString = Substring(F.OperatorString, CharIndex(' ', F.OperatorString) + 1, 8000)
) O
CROSS APPLY (
SELECT
Value = Convert(decimal(10,2), Left(O.ValueString, CharIndex(' ', O.ValueString + ' ') - 1)),
Remainder = Substring(O.ValueString, CharIndex(' ', O.ValueString + ' ') + 3, 8000)
) V
)
SELECT
D.ID,
D.Remainder,
H.Factor,
H.Operator,
H.Value
INTO
dbo.Rulesets
FROM
Depths D
OUTER APPLY (
SELECT
X.Factor,
X.Operator,
Value = Min(X.Value * M.Multiplier) * M.Multiplier
FROM
N
CROSS APPLY (
SELECT TOP 1
*
FROM
Depths D2
WHERE
N.Num = D2.Depth
AND D.ID >= D2.ID
ORDER BY
D2.ID DESC
) X
CROSS APPLY (
SELECT 1 WHERE X.Operator = '<'
UNION ALL SELECT -1 WHERE X.Operator = '>='
) M (Multiplier)
WHERE
N.Num <= D.Depth
GROUP BY
X.Factor,
X.Operator,
M.Multiplier
) H
WHERE
D.Remainder <> ''
ORDER BY
D.ID,
H.Factor,
H.Operator
;下面是结果数据的样子(只需要和显示叶节点ID):
ID Remainder Factor Operator Value
---- --------------------------- ------ -------- ---------------------------------------
4 19.44 (13/43.71) [13/77.47] fac_a < 49.5
4 19.44 (13/43.71) [13/77.47] fac_d < 23.5
5 24.25 (32/23.65) [16/49.15] fac_a < 49.5
5 24.25 (32/23.65) [16/49.15] fac_d < 71.5
5 24.25 (32/23.65) [16/49.15] fac_d >= 23.5
6 30.8 (10/17.68) [5/22.44] fac_a < 64.0
6 30.8 (10/17.68) [5/22.44] fac_a >= 49.5
6 30.8 (10/17.68) [5/22.44] fac_d < 71.5
7 33.6 (25/53.05) [15/47.35] fac_a < 64.0
7 33.6 (25/53.05) [15/47.35] fac_d >= 71.5
12 31.9 (16/23.25) [3/64.14] fac_a < 91.0
12 31.9 (16/23.25) [3/64.14] fac_a >= 64.0
12 31.9 (16/23.25) [3/64.14] fac_d < 45.0
12 31.9 (16/23.25) [3/64.14] fac_e < 93.5
14 44.1 (5/16.58) [2/21.39] fac_a < 91.0
14 44.1 (5/16.58) [2/21.39] fac_a >= 64.0
14 44.1 (5/16.58) [2/21.39] fac_d < 83.5
14 44.1 (5/16.58) [2/21.39] fac_d >= 45.0
14 44.1 (5/16.58) [2/21.39] fac_e < 21.5
16 33.45 (4/2.89) [1/0.03] fac_a < 77.5
16 33.45 (4/2.89) [1/0.03] fac_a >= 64.0
16 33.45 (4/2.89) [1/0.03] fac_d < 83.5
16 33.45 (4/2.89) [1/0.03] fac_d >= 45.0
16 33.45 (4/2.89) [1/0.03] fac_e < 93.5
16 33.45 (4/2.89) [1/0.03] fac_e >= 21.5
17 39.46 (7/10.21) [1/11.69] fac_a < 91.0
17 39.46 (7/10.21) [1/11.69] fac_a >= 77.5
17 39.46 (7/10.21) [1/11.69] fac_d < 83.5
17 39.46 (7/10.21) [1/11.69] fac_d >= 45.0
17 39.46 (7/10.21) [1/11.69] fac_e < 93.5
17 39.46 (7/10.21) [1/11.69] fac_e >= 21.5
18 45.97 (2/8.03) [1/107.71] fac_a < 91.0
18 45.97 (2/8.03) [1/107.71] fac_a >= 64.0
18 45.97 (2/8.03) [1/107.71] fac_d < 83.5
18 45.97 (2/8.03) [1/107.71] fac_e >= 93.5
19 42.26 (9/9.57) [4/69.03] fac_a >= 91.0
19 42.26 (9/9.57) [4/69.03] fac_d < 83.5
20 47.1 (9/30.24) [6/40.15] fac_a >= 64.0
20 47.1 (9/30.24) [6/40.15] fac_d >= 83.5我创造了一些假的样本探针数据。注意,这里的因素是行的,而不是列的。如果您通过fac_a通过fac_z,然后通过fac_aa通过fac_zz,那么您仍然在做生意。
WITH A AS (SELECT A = 1 UNION ALL SELECT 1),
B AS (SELECT A = 1 FROM A, A B),
C AS (SELECT A = 1 FROM B, B C),
N AS (SELECT Num = Row_Number() OVER (ORDER BY (SELECT 1)) - 1 FROM B, C, C D)
SELECT
N.Num,
F.Factor,
V.Value
INTO
dbo.LookupData
FROM
N
CROSS JOIN (VALUES
(1, 'fac_a'), (4, 'fac_b'), (16, 'fac_c'), (64, 'fac_d'), (256, 'fac_e')
) F (Mult, Factor)
INNER JOIN (VALUES
(0, 25), (1, 50), (2, 75), (3, 100)
) V (Pattern, Value)
ON (N.Num / F.Mult) % 4 = V.Pattern
WHERE
N.Num <= 1023
;示例探测数据:
Num Factor Value
------ ------ -----------
0 fac_a 25
0 fac_b 25
0 fac_c 25
0 fac_d 25
0 fac_e 25
1 fac_a 50
1 fac_b 25
1 fac_c 25
1 fac_d 25
1 fac_e 25
2 fac_a 75
2 fac_b 25
2 fac_c 25
2 fac_d 25
2 fac_e 25
...
1021 fac_a 50
1021 fac_b 100
1021 fac_c 100
1021 fac_d 100
1021 fac_e 100
1022 fac_a 75
1022 fac_b 100
1022 fac_c 100
1022 fac_d 100
1022 fac_e 100
1023 fac_a 100
1023 fac_b 100
1023 fac_c 100
1023 fac_d 100
1023 fac_e 100最后,下面是一个查询,它显示了与探测行的条件相匹配的Weka树中最内部的ID行。请记住,我没有在这里创建适当的索引,您应该这样做。使用每个因素的值25、50、75和100,这将创建每个可能的组合:
WITH Matches AS (
SELECT
L.Num,
R.ID
FROM
dbo.LookupData L
INNER JOIN dbo.Rulesets R
ON L.Factor = R.Factor
GROUP BY
L.Num,
R.ID
HAVING
Min(CASE WHEN (
R.Operator = '<'
AND L.Value < R.Value
) OR (
R.Operator = '>='
AND L.Value >= R.Value
) THEN 1 ELSE 0 END) = 1
)
SELECT
L.*,
W.*
FROM
dbo.LookupData L
INNER JOIN Matches M
ON L.Num = M.Num
LEFT JOIN dbo.WekaTree W
ON M.ID = W.ID
ORDER BY
L.Num
;举例结果:
Num Factor Value ID Ruleset
--- ------ ----- -- -------------------------------------------------------
0 fac_a 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_b 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_c 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_d 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
0 fac_e 25 5 | | | fac_d >= 23.5 : 24.25 (32/23.65) [16/49.15]
1 fac_a 50 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_b 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_c 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_d 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
1 fac_e 25 6 | | fac_a >= 49.5 : 30.8 (10/17.68) [5/22.44]
2 fac_a 75 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_b 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_c 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_d 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]
2 fac_e 25 12 | | | | fac_d < 45 : 31.9 (16/23.25) [3/64.14]请随意问任何你喜欢的问题--我很乐意帮助你在一次针对你自己数据的测试中完成这项工作。我不能保证即时响应,但是我通常每天都会检查活动情况,所以在大多数情况下,至少可以在一两天内回复。
查看的现场演示
https://stackoverflow.com/questions/21809992
复制相似问题