我试图在单个表中的嵌套字段之间进行一些联接/聚合,同时遇到SQL问题和“引用其他表时不支持的相关子查询,除非它们可以去关联,例如将它们转换为有效的联接”错误。
我希望在一般问题上提供一些SQL帮助,但我也很好奇如何处理这个错误。
我的问题映射到BigQuery专利数据。在该数据集中,专利具有分类数据( cpc记录,其中cpc.code是记录上具有关联数据cpc.inventive和cpc.first的分类代码)。专利也有它引用的专利( citation record,其中citation.publication_number是被引用的带有关联数据citation.type和citation.category的专利)。这些记录中有更多的字段,但让我们假设这些是重要的字段。
我想得到的是这样的json,每个CPC有一行,有记录,其中记录了CPC的专利是如何引用其他专利的,这些专利是基于CPC在引用的专利和引用的方面上的。json看起来会是这样的:
[
{
"citing_patent_cpc": "1234/123",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 45
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 12
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 3
},
...
]
},
{
"citing_patent_cpc": "1234/ABC",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 16
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 3
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 9
},
...
]
},
...
]其中每个唯一的cpc.code都得到一行和一个数组。该数组记录了专利引用的专利数量("citing_patent_cpc"),其中有一个特定的CPC (" CPC "),包含两项专利cpc的各个方面以及引用类型。
例如,上述例子中的第一项记录意味着,45倍于CPC "1234/123“的专利是创造性的CPC,而不是第一次CPC引用另一项专利,CPC "ABCD/345”作为第一cpc,而不是创造性的CPC,这一引文是"ABC“和"A”类。理论上,每行都可以记录到语料库中的每个CPC *可能的方面的数量,但实际上并非如此。
作为部分步骤,我试图加入中国共产党的记录,从被引用的专利到引用专利的记录。我让这个查询直接处理SQL中声明的一个非常小的表,但是它提供了“引用其他表的关联子查询,除非它们可以去关联,例如将它们转换成一个有效的联接”。当我试图在大数据上运行它时(比如实际的专利表)。
下面是一个查询:
SELECT
publication_number,
cpc,
citation,
(
SELECT ARRAY_CONCAT_AGG(cpc)
FROM `patents-public-data.patents.publications` AS JoinedPatents
RIGHT JOIN
(
SELECT publication_number
FROM UNNEST(Patents.citation)
) AS unnestedcitation
ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents我很想知道:
感谢读了这么多书的人。
发布于 2020-02-04 22:23:36
我认为这应该与您想要的查询非常接近。它看起来是一个大数据集,所以我不能评论速度/效率。希望这逻辑至少是有意义的。
with data as (
-- unnest your data
select
p.publication_number,
cp.code as cpc_code,
cp.inventive as cpc_inventive,
cp.first as cpc_first,
ci.publication_number as citation_publication_number,
ci.type as citation_type,
ci.category as citation_category
from `patents-public-data.patents.publications` p
left join unnest(cpc) cp
left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
select
d1.cpc_code as citing_patent_cpc,
d2.cpc_code as cpc,
d1.cpc_inventive as citing_cpc_inventive,
d1.cpc_first as citing_cpc_first,
d1.citation_type,
d1.citation_category,
d2.cpc_inventive as cited_cpc_inventive,
d2.cpc_first as cited_cpc_first,
count(*) as count
from data d1
left join data d2 on d1.citation_publication_number = d2.publication_number
group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
select
citing_patent_cpc,
array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
from joined
group by 1
)
select * from aggedhttps://stackoverflow.com/questions/60064392
复制相似问题