简短问题:我想根据列的不同值将一个BQ表拆分为多个小表。因此,如果列country有10个不同的值,它应该将表拆分为10个单独的表,每个表都有各自的country数据。最好是从BQ查询中完成(使用INSERT、MERGE等)。
我现在正在做的是将数据导入到gstorage ->本地存储->中,在本地执行拆分,然后推送到表中(这是一个非常耗时的过程)。
谢谢。
发布于 2018-11-05 15:00:58
如果数据具有相同的模式,只需将其保留在一个表中,并使用集群特性:https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#creating_a_clustered_table
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY dateCol
CLUSTER BY country
OPTIONS (
description="a table clustered by country"
) AS (
SELECT ....
)https://cloud.google.com/bigquery/docs/clustered-tables
不过,该功能仍处于测试阶段。
发布于 2018-12-31 23:56:25
为此,您可以使用Dataflow。This answer给出了一个管道示例,该管道查询BigQuery表,根据列拆分行,然后将它们输出到不同的PubSub主题(可以是不同的BigQuery表):
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
final TupleTag<String> readings2010 = new TupleTag<String>() {
};
final TupleTag<String> readings2000plus = new TupleTag<String>() {
};
final TupleTag<String> readingsOld = new TupleTag<String>() {
};
PCollectionTuple collectionTuple = weatherData.apply(ParDo.named("tablerow2string")
.withOutputTags(readings2010, TupleTagList.of(readings2000plus).and(readingsOld))
.of(new DoFn<TableRow, String>() {
@Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
if (c.element().getF().get(2).getV().equals("2010")) {
c.output(c.element().toString());
} else if (Integer.parseInt(c.element().getF().get(2).getV().toString()) > 2000) {
c.sideOutput(readings2000plus, c.element().toString());
} else {
c.sideOutput(readingsOld, c.element().toString());
}
}
}));
collectionTuple.get(readings2010)
.apply(PubsubIO.Write.named("WriteToPubsub1").topic("projects/fh-dataflow/topics/bq2pubsub-topic1"));
collectionTuple.get(readings2000plus)
.apply(PubsubIO.Write.named("WriteToPubsub2").topic("projects/fh-dataflow/topics/bq2pubsub-topic2"));
collectionTuple.get(readingsOld)
.apply(PubsubIO.Write.named("WriteToPubsub3").topic("projects/fh-dataflow/topics/bq2pubsub-topic3"));
p.run();https://stackoverflow.com/questions/53108859
复制相似问题