我有一个帖子表,一个转贴表,以及一个代表“用户跟随”状态的表格。
我想做一些像Twitter这样的事情,在那里我展示了所有跟踪用户的帖子或转贴。
我希望文章在第一次出现时出现,这样,如果多个用户重新发布该帖子,它只会在第一次出现。
为了加快查询速度,每当创建post时,我都会将其插入到repost表中,这样也会创建相应的repost (来自作者)。
我的模式如下所示:
Table Post
id: INT
userId: INT
time: INT
Table Repost
id: INT
postId: INT
userId: INT
time: INT
Table users_following
userId: INT
followerId: INT我的查询看起来是这样的。
SELECT sr.* FROM Repost sr
INNER JOIN (
SELECT MIN(ir.time) min_time, ir.postId FROM Repost ir
WHERE ir.userId IN (
SELECT uf.userId FROM users_following uf WHERE
ir.userId = uf.userId AND uf.followerId = 1
)
OR ir.userId = 1
GROUP BY ir.postId
) rr ON rr.postId = sr.postId AND sr.time = rr.min_time他们的想法是:
这是可行的,但第三阶段是缓慢的。我相信这是因为一旦我们有了一个大的min_times列表,我们就不能使用任何索引从这个子查询中进行选择,这意味着我们需要扫描所有的东西。是否有一种方法来构造这个查询,使其更具表现性?
这是为核心读者提供的完整的EXPLAIN和SHOW CREATE TABLE。
解释
+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 797455 | 100.00 | Using where |
| 1 | PRIMARY | sr | NULL | ref | IDX_DA9843F3E094D20D,repost_time_idx,repost_stream_idx | repost_time_idx | 4 | rr.min_time | 1 | 4.92 | Using where |
| 2 | DERIVED | ir | NULL | index | IDX_DA9843F364B64DCC,IDX_DA9843F3E094D20D,repost_stream_idx | IDX_DA9843F3E094D20D | 4 | NULL | 797456 | 100.00 | Using where |
| 3 | DEPENDENT SUBQUERY | uf | NULL | eq_ref | PRIMARY,IDX_17C2F70264B64DCC,IDX_17C2F702F542AA03 | PRIMARY | 8 | prose_2_24_2021.ir.userId,const | 1 | 100.00 | Using where; Using index |
+----+--------------------+------------+------------+--------+-------------------------------------------------------------+----------------------+---------+---------------------------------+--------+----------+--------------------------+显示创建表Repost
CREATE TABLE `Repost` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`postId` int(11) NOT NULL,
`userId` int(11) NOT NULL,
`time` int(11) NOT NULL,
`isRepost` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_DA9843F364B64DCC` (`userId`),
KEY `IDX_DA9843F3E094D20D` (`postId`),
KEY `repost_time_idx` (`time`),
KEY `repost_stream_idx` (`time`,`userId`,`postId`),
CONSTRAINT `FK_DA9843F364B64DCC` FOREIGN KEY (`userId`) REFERENCES `ProseUser` (`id`),
CONSTRAINT `FK_DA9843F3E094D20D` FOREIGN KEY (`postId`) REFERENCES `Post` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=809018 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 显示创建表users_following
CREATE TABLE `users_following` (
`userId` int(11) NOT NULL,
`followerId` int(11) NOT NULL,
PRIMARY KEY (`userId`,`followerId`),
KEY `IDX_17C2F70264B64DCC` (`userId`),
KEY `IDX_17C2F702F542AA03` (`followerId`),
CONSTRAINT `FK_17C2F70264B64DCC` FOREIGN KEY (`userId`) REFERENCES `ProseUser` (`id`),
CONSTRAINT `FK_17C2F702F542AA03` FOREIGN KEY (`followerId`) REFERENCES `ProseUser` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 编辑
像这样调整查询会产生更快的结果,尽管添加ORDER BY会使其变得更慢。没有ORDER BY,这个查询是很棒的。
SELECT sr.* FROM Repost sr
INNER JOIN (
SELECT MIN(ir.time) min_time, ir.postId FROM Repost ir
INNER JOIN users_following uf ON ir.userId = uf.userId AND uf.followerId = 1
GROUP BY ir.postId
) rr ON rr.postId = sr.postId AND sr.time = rr.min_time
ORDER BY sr.time desc
LIMIT 10以下是对此查询的解释:
+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 691 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | sr | NULL | ref | IDX_DA9843F3E094D20D,repost_time_idx,repost_stream_idx,repost_stream2_idx | repost_stream2_idx | 8 | rr.min_time,rr.postId | 1 | 100.00 | NULL |
| 2 | DERIVED | uf | NULL | ref | PRIMARY,IDX_17C2F70264B64DCC,IDX_17C2F702F542AA03 | IDX_17C2F702F542AA03 | 4 | const | 145 | 100.00 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | ir | NULL | ref | IDX_DA9843F364B64DCC,IDX_DA9843F3E094D20D,repost_stream_idx,repost_stream2_idx | IDX_DA9843F364B64DCC | 4 | prose_2_24_2021.uf.userId | 9 | 100.00 | NULL |
| 2 | DERIVED | rp | NULL | eq_ref | PRIMARY,post_spotlight_idx,post_time_idx,post_trending_idx | PRIMARY | 4 | prose_2_24_2021.ir.postId | 1 | 50.00 | Using where |
+----+-------------+------------+------------+--------+--------------------------------------------------------------------------------+----------------------+---------+---------------------------+------+----------+----------------------------------------------+发布于 2021-02-25 23:51:00
因此,我编写这样的排名查询的典型方式是:
select id, postid, userid, time
from
(
select rp.*, min(time) over (partition by postid) as first_time
from repost rp
where userid = 1
or userid in (select userid from users_following where followerid = 1)
) numbered
where time = first_time;有时候优化器在OR上有问题,如果他们认为这样做更快的话,他们无法看到他们可以在表中运行两次。在这种情况下,我们可以使用UNION提示
select id, postid, userid, time
from
(
select rp.*, min(time) over (partition by postid) as first_time
from
(
select *
from repost
where userid = 1
union all
select *
from repost
where userid in (select userid from users_following where followerid = 1)
) rp
) numbered
where time = first_time;有一次,MySQL因处理IN子句有问题而闻名。我不认为现在是这样了。如果DBMS确实有问题,可以使用EXISTS:
from repost rp
where exists
(
select null
from users_following uf
where uf.userid = rp.userid
and uf.followerid = 1
)在版本8之前的MySQL版本中,MIN OVER这样的分析函数是不可用的。在这些版本中,您必须找到每个帖子的最短时间,然后再读一次表。一个直截了当的方法:
select *
from repost
where (postid, time) in
(
select postid, min(time)
from repost
where userid = 1
or userid in (select userid from users_following where followerid = 1)
group by postid
);在任何情况下,您都需要一个索引来查找跟踪用户。DBMS可以随重发用户而来,并检查它们是否后面跟着用户#1,或者接受用户#1并查找所有跟随的用户。因此,我提供了两个索引:
create index idx1 on users_following (userid, followerid);
create index idx2 on users_following (followerid, userid);然后,您希望快速找到他们的帖子,然后按post ID和order按时间分组。这方面的指数:
create index idx3 on repost (userid, postid, time);另一种看待这个问题的方法是:如果我们遍历整个表并为所需的用户保留行,那么如果已经按照postid、time对行进行排序,那就太好了。所以以防万一:
create index idx3 on repost (postid, time);进行完整的索引扫描。
索引是提供给DBMS的。DBMS可以接受此建议并使用索引或不使用索引。我经常做的事:
EXPLAIN查看我的哪些索引被使用。。
发布于 2021-02-26 20:58:00
Repost需要对来自
PRIMARY KEY (`id`),
KEY `IDX_DA9843F364B64DCC` (`userId`),
KEY `IDX_DA9843F3E094D20D` (`postId`),
KEY `repost_time_idx` (`time`),
KEY `repost_stream_idx` (`time`,`userId`,`postId`),至
PRIMARY KEY(postId, userId, time, id), -- `id` is for uniqueness
INDEX(id) -- to keep AUTO_INCREMENT happy(我不知道其他人是否有用。)
将IN ( SELECT ... )更改为EXISTS ( SELECT 1 ... )
OR是性能杀手。用OR的一边计时查询,然后用另一侧计时。假设这些时间的总和比你当前的时间要快,那么把这些时间加在一起吧。如果可行的话,简化每个查询。给我看看结果,我可能有更多的索引建议。
https://stackoverflow.com/questions/66363209
复制相似问题