首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >BigQuery中的reddit数据集-查询flair?

BigQuery中的reddit数据集-查询flair?
EN

Stack Overflow用户
提问于 2020-01-31 04:34:06
回答 1查看 796关注 0票数 1

我知道BigQuery中有一堆编辑评论和故事--这是由pushshift.io的Jason收集的。

如何查询此数据集以获得减法编辑的flair列表?

这是我的基本查询:

代码语言:javascript
复制
SELECT link_flair_text 
FROM `fh-bigquery.reddit_posts.2019_08` 
WHERE subreddit  = 'AmItheAsshole'
EN

回答 1

Stack Overflow用户

发布于 2020-01-31 04:34:06

让我们构建这个可视化:

这样的查询就可以了--这个查询将得到/r/AmITheAsshole上的帖子数,按照大多数典型的判断顺序排列:

代码语言:javascript
复制
SELECT link_flair_text, COUNT(*) c 
FROM `fh-bigquery.reddit_posts.2019_08` 
WHERE subreddit  = 'AmItheAsshole'
GROUP BY 1
ORDER BY 2 DESC 
LIMIT 1000

要查询几个月的数据,可以在*上使用FROM

代码语言:javascript
复制
SELECT link_flair_text, COUNT(*) c 
FROM `fh-bigquery.reddit_posts.2019_*` 
WHERE subreddit  = 'AmItheAsshole'
GROUP BY 1
ORDER BY 2 DESC 
LIMIT 1000

并观察每月百分比的变化情况:

代码语言:javascript
复制
SELECT link_flair_text, ARRAY_AGG(STRUCT(month, ROUND(100*c/total_month,1) AS perc, c) ORDER BY month) AS arr
FROM (
  SELECT *, SUM(c) OVER(PARTITION BY month) total_month
  FROM (
    SELECT link_flair_text, DATE(TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH)) month, COUNT(*) c
    FROM `fh-bigquery.reddit_posts.2019_*` 
    WHERE subreddit  = 'AmItheAsshole'
    AND link_flair_text IS NOT NULL
    GROUP BY 1,2 
    HAVING c > 100
  )
)
GROUP BY 1
ORDER BY SUM(c) DESC

以及上面所示的viz的SQL:

代码语言:javascript
复制
CREATE OR REPLACE TABLE `temp.assholes`
AS
SELECT *, c/total_month AS perc
FROM (
  SELECT *, SUM(c) OVER(PARTITION BY month) total_month, SUM(c) OVER(PARTITION BY link_flair_text) total_flair
  FROM (
    SELECT LOWER(link_flair_text) link_flair_text, DATE(TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH)) month, COUNT(*) c
    FROM `fh-bigquery.reddit_posts.201*` 
    WHERE subreddit = 'AmItheAsshole'
    AND link_flair_text IS NOT NULL
    AND _table_suffix >= '8_03'
    GROUP BY 1,2 
  )
)
WHERE c/total_month > 0.01
AND total_flair > 1000
ORDER BY month

如何在2016年前带来数据?

几乎相同的查询,但它包含了一个不同的表,其中包含了2016年以前的所有历史内容:

代码语言:javascript
复制
WITH data AS (
  SELECT * FROM `fh-bigquery.reddit_posts.201*` WHERE _table_suffix >= '5_12'
  UNION ALL  
  SELECT * FROM `fh-bigquery.reddit_posts.full_corpus_201512`  
)

SELECT *, c/total_month AS perc
FROM (
  SELECT *, SUM(c) OVER(PARTITION BY month) total_month, SUM(c) OVER(PARTITION BY link_flair_text) total_flair
  FROM (
    SELECT LOWER(link_flair_text) link_flair_text, DATE(TIMESTAMP_TRUNC(TIMESTAMP_SECONDS(created_utc), MONTH)) month, COUNT(*) c
    FROM data
    WHERE subreddit  = 'AmItheAsshole'
    AND link_flair_text IS NOT NULL
    GROUP BY 1,2 
  )
)
WHERE c/total_month > 0.01
AND total_flair > 1000
ORDER BY month
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59997898

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档