我有一个MYSQL/MariaDB表,其数据如下所示:
language phrase tries success
----------- --------------- ----- -------
Spanish hola como estas 5 3
Spanish como estas tu 3 1
Spanish adios me voy 2 0
Spanish adios me voy 1 0
Spanish adios me voy 2 1
Spanish como voy 3 2
English hello how are you 5 2
English hello who are you 3 0
English how good are you 5 1
English be good 3 3
French au revoir 4 0
French merci beaucoup 2 1
French merci beaucoup 5 2
French beaucoup des choses 2 2
French voir et revoir 3 2我想对每种语言和短语中的每一个单词进行分组,量化每个单词的“成功”,方法是将尝试次数和成功次数相加,然后计算它们之间的比率,例如,在“西班牙语”中,"como“一词出现在3行,加11次尝试,6次成功,所以西班牙语中"como”的“比率”为0.54,然后对每个单词这样结束:
language word tries successes ratio
-------- ------ ----- --------- -----
spanish hola 5 3 0.60
spanish como 11 6 0.54
spanish estas 8 4 0.54
spanish tu 3 1 0.33
spanish adios 5 1 0.20
spanish me 5 1 0.20
spanish voy 8 3 0.38
english hello 8 2 0.25
english how 10 3 0.30
english are 13 3 0.23
english you 13 3 0.23
english who 3 0 0
english good 8 4 0.54
english be 3 3 1
french au 4 0 0
french revoir 7 2 0.29
french merci 7 3 0.43
french beaucoup 9 5 0.55
french des 2 2 1
french choses 2 2 1
french voir 3 2 0.66
french et 3 2 0.66我不知道如何在SQL中实现这一点。
真正的表格包含大约50,000行和短语,可以有1到7个单词,其中最常见的是2-4个单词。
发布于 2019-09-19 19:10:41
您必须首先遍历字符串的单词。您可以使用SQL中的递归查询来执行此操作。剩下的就是聚合。
with recursive words (language, word, tries, success, rest) as
(
select
language,
substring_index(concat(trim(phrase), ' '), ' ', 1) as word,
tries,
success,
substring(trim(phrase), instr(concat(trim(phrase), ' '), ' ') + 1) as rest
from mytable
union all
select
language,
substring_index(concat(rest, ' '), ' ', 1) as word,
tries,
success,
substring(rest, instr(concat(rest, ' '), ' ') + 1) as rest
from words
where rest <> ''
)
select
language,
word,
sum(tries) as sum_tries,
sum(success) as sum_success,
sum(success) / sum(tries) as ratio
from words
group by language, word
order by language, word;演示:10.2&fiddle=dd68b85cc7478697edebd0a9abd5b87f
下面是MariaDB中的递归查询--解释:https://mariadb.com/kb/en/library/recursive-common-table-expressions-overview/
发布于 2019-09-21 13:29:51
解决方案1:序列存储引擎
较新的MariaDB版本在https://mariadb.com/kb/en/library/sequence-storage-engine/中具有构建。例如,(虚拟)表seq_1_to_1000包含从1到1000之间的序列号。这可以用来把每个单词的短语分割成行。考虑以下查询:
select
t.*,
s.seq,
substring_index(substring_index(t.phrase, ' ', s.seq), ' ', -1) as word
from tbl t
join seq_1_to_1000 s
on s.seq <= char_length(t.phrase) - char_length(replace(t.phrase, ' ', '')) + 1char_length(t.phrase) - char_length(replace(t.phrase, ' ', '')) + 1将给出短语中的单词数。因此,我们从序列表中加入的行与短语中的单词一样多。s.seq列现在可以用作短语中单词的位置,substring_index(substring_index(t.phrase, ' ', s.seq), ' ', -1)将提取这个单词。原始表的前两行将转换为
language phrase tries success seq word
-------- --------------- ----- ------- --- -----
Spanish hola como estas 5 3 1 hola
Spanish hola como estas 5 3 2 como
Spanish hola como estas 5 3 3 estas
Spanish como estas tu 3 1 1 como
Spanish como estas tu 3 1 2 estas
Spanish como estas tu 3 1 3 tu 现在,每个字有一行,我们只需要按语言和单词分组并计算和:
select
t.language,
substring_index(substring_index(t.phrase, ' ', s.seq), ' ', -1) as word,
sum(t.tries) as tries,
sum(t.success) as success,
sum(t.success) / sum(t.tries) as ratio
from tbl t
join seq_1_to_1000 s
on s.seq <= char_length(t.phrase) - char_length(replace(t.phrase, ' ', '')) + 1
group by t.language, word最终结果:
language word tries success ratio
-------- -------- ----- ------- ------
English are 13 3 0.2308
English be 3 3 1
English good 8 4 0.5
English hello 8 2 0.25
English how 10 3 0.3
English who 3 0 0
English you 13 3 0.2308
French au 4 0 0
French beaucoup 9 5 0.5556
French choses 2 2 1
French des 2 2 1
French et 3 2 0.6667
French merci 7 3 0.4286
French revoir 7 2 0.2857
French voir 3 2 0.6667
Spanish adios 5 1 0.2
Spanish como 11 6 0.5455
Spanish estas 8 4 0.5
Spanish hola 5 3 0.6
Spanish me 5 1 0.2
Spanish tu 3 1 0.3333
Spanish voy 8 3 0.375 参见db<>fiddle演示
局限性
seq_1_to_1000每个短语最多只能工作1000个单词。但是,如果你可以拥有更多,只需增加数量。如果有疑问,请使用seq_1_to_1000000000000000,这将足够用于1000 If字符串。注意:因为它不是真正的表,所以由于ON子句,只会生成所需的序列号。<word>[(<space><word>)*]。解决方案2:递归CTE
我们还可以使用递归CTE (自MariaDB 10.2和MySQL 8.0支持)将短语拆分为单词。请考虑以下几点:
with recursive rcte as (
select language, phrase, tries, success, 1 as pos
from tbl t
union all
select language, phrase, tries, success, pos + 1
from rcte
where pos <= char_length(phrase) - char_length(replace(phrase, ' ', ''))
)
select *
from rcte
order by language, phrase, pos我们从位置1 (1 as pos)开始,在每一次递归中增加1 (pos + 1)。直到位置达到字数(where pos <= char_length(phrase) - char_length(replace(phrase, ' ', '')))为止,才会得到这个结果。这样,我们“复制”每一个短语的次数,因为它中有单词,有一个额外的列pos与序号。
两个短语的结果示例:
language phrase tries success pos
-------- ------------------ ----- ------- ---
English be good 3 3 1
English be good 3 3 2
English hello how are you 5 2 1
English hello how are you 5 2 2
English hello how are you 5 2 3
English hello how are you 5 2 4 我们现在可以使用与解决方案1中相同的单词提取技术。
with recursive rcte as (
select language, phrase, tries, success, 1 as pos
from tbl t
union all
select language, phrase, tries, success, pos + 1
from rcte
where pos <= char_length(phrase) - char_length(replace(phrase, ' ', ''))
)
select
language,
substring_index(substring_index(phrase, ' ', pos), ' ', -1) as word,
sum(tries) as tries,
sum(success) as success,
sum(success) / sum(tries) as ratio
from rcte
group by language, word基于db<>fiddle的演示
https://stackoverflow.com/questions/57964349
复制相似问题