我正在使用pydruid查询一个德鲁伊数据库,并希望计算聚合后的结果,其中一个聚合为真,另一个为假。
我已经能够使用curl向druid数据库发送一个JSON格式的查询,从而计算出聚合后的结果。
使用pydruid,我已经能够计算初始聚集和两个聚合组的相交的后聚集。我试图找到一种方法来使用ThetaSketchOp类,但到目前为止还没有成功。
下面是我到目前为止在pydruid中使用ThetaSketchOp类的尝试:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketchOp(
fn='not',
fields=[
postaggregator.ThetaSketch('apple'),
postaggregator.ThetaSketch('orange')
],
name='testing'
)
)
}
)下面是json格式的查询,当用于查询druid数据库时,它会产生所需的结果:
{
"queryType": "groupBy",
"dataSource": "datasource",
"granularity": "ALL",
"dimensions": [],
"aggregations": [
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "apple"
},
"aggregator" : {
"type": "thetaSketch", "name": "apple", "fieldName": "person"
}
},
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "orange"
},
"aggregator" : {
"type": "thetaSketch", "name": "orange", "fieldName": "person"
}
}
],
"postAggregations": [
{
"type": "thetaSketchEstimate",
"name": "apple_&_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "INTERSECT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
},
{
"type": "thetaSketchEstimate",
"name": "apple_&_not_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "NOT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
}
],
"intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}感谢您的阅读。如果还有其他需要我提供的信息,请告诉我。
发布于 2019-07-09 21:56:21
如果您使用!=运算符创建NOT theta theta op,则似乎可以工作:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') !=
postaggregator.ThetaSketch('orange')
)
}
)(我通过深入研究pydruid源代码找到了这个。)
https://stackoverflow.com/questions/56953155
复制相似问题