文章/答案/技术大牛

发布

社区首页 >问答首页 >BigQuery中嵌套记录的自连接

问BigQuery中嵌套记录的自连接
EN

Stack Overflow用户

提问于 2020-02-04 19:32:44

回答 1查看 465关注 0票数 1

我试图在单个表中的嵌套字段之间进行一些联接/聚合，同时遇到SQL问题和“引用其他表时不支持的相关子查询，除非它们可以去关联，例如将它们转换为有效的联接”错误。

我希望在一般问题上提供一些SQL帮助，但我也很好奇如何处理这个错误。

我的问题映射到BigQuery专利数据。在该数据集中，专利具有分类数据( cpc记录，其中cpc.code是记录上具有关联数据cpc.inventive和cpc.first的分类代码)。专利也有它引用的专利( citation record，其中citation.publication_number是被引用的带有关联数据citation.type和citation.category的专利)。这些记录中有更多的字段，但让我们假设这些是重要的字段。

我想得到的是这样的json，每个CPC有一行，有记录，其中记录了CPC的专利是如何引用其他专利的，这些专利是基于CPC在引用的专利和引用的方面上的。json看起来会是这样的：

[
  {
      "citing_patent_cpc": "1234/123",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 45
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 12
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 3
        },
        ...

      ]
  },
  {
      "citing_patent_cpc": "1234/ABC",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 16
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 3
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 9
        },
        ...
      ]
  },
  ...
]

其中每个唯一的cpc.code都得到一行和一个数组。该数组记录了专利引用的专利数量("citing_patent_cpc")，其中有一个特定的CPC (" CPC ")，包含两项专利cpc的各个方面以及引用类型。

例如，上述例子中的第一项记录意味着，45倍于CPC "1234/123“的专利是创造性的CPC，而不是第一次CPC引用另一项专利，CPC "ABCD/345”作为第一cpc，而不是创造性的CPC，这一引文是"ABC“和"A”类。理论上，每行都可以记录到语料库中的每个CPC *可能的方面的数量，但实际上并非如此。

作为部分步骤，我试图加入中国共产党的记录，从被引用的专利到引用专利的记录。我让这个查询直接处理SQL中声明的一个非常小的表，但是它提供了“引用其他表的关联子查询，除非它们可以去关联，例如将它们转换成一个有效的联接”。当我试图在大数据上运行它时(比如实际的专利表)。

下面是一个查询：

SELECT
  publication_number,
  cpc,
  citation,
  (
    SELECT ARRAY_CONCAT_AGG(cpc)
    FROM `patents-public-data.patents.publications` AS JoinedPatents
    RIGHT JOIN
      (
        SELECT publication_number
        FROM UNNEST(Patents.citation)
      ) AS unnestedcitation
    ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents

我很想知道：

如何使此查询正常工作而不出现此错误。
我的问题的整体解决方案，如果有人觉得宽宏大量的SQL-fu。

感谢读了这么多书的人。

join

google-bigquery

record

sql

arrays

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-02-04 22:23:36

我认为这应该与您想要的查询非常接近。它看起来是一个大数据集，所以我不能评论速度/效率。希望这逻辑至少是有意义的。

with data as (
-- unnest your data
  select 
    p.publication_number,
    cp.code as cpc_code,
    cp.inventive as cpc_inventive,
    cp.first as cpc_first,
    ci.publication_number as citation_publication_number,
    ci.type as citation_type,
    ci.category as citation_category
  from `patents-public-data.patents.publications` p
  left join unnest(cpc) cp
  left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
  select 
    d1.cpc_code as citing_patent_cpc,
    d2.cpc_code as cpc,
    d1.cpc_inventive as citing_cpc_inventive,
    d1.cpc_first as citing_cpc_first,
    d1.citation_type,
    d1.citation_category,
    d2.cpc_inventive as cited_cpc_inventive,
    d2.cpc_first as cited_cpc_first,
    count(*) as count
  from data d1
  left join data d2 on d1.citation_publication_number = d2.publication_number
  group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
  select 
    citing_patent_cpc,
    array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
  from joined
  group by 1
)
select * from agged

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60064392

复制

相似问题

问BigQuery中嵌套记录的自连接
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BigQuery中嵌套记录的自连接EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BigQuery中嵌套记录的自连接
EN