我正在尝试从PySpark查询BigQuery中的__TABLES__数据。我使用下面的代码来查询系统表:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1')\
.getOrCreate()
spark.read.format('bigquery')\
.option("credentials", my_key)\
.option("project", 'my-parent-project') \
.option('table', 'my-dataset.__TABLES__') \
.load()一直持续到2021/06/25。第二天,对于这段代码,我突然开始收到这个错误消息:
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:563)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed(堆栈跟踪甚至比这个更长,如果需要,可以索要它,我会发布剩下的部分)。
BigQuery中有什么变化吗?该错误消息对我的故障排除帮助不大,有什么建议吗?为了添加更多上下文,我可以使用相同的代码查询同一数据集中的其他表。
我用Spark 3.0.1和BigQuery连接器com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.1观察到了这种行为。我在使用Spark 2.4.3和BigQuery连接器com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1时也体验到了同样的行为
发布于 2021-06-28 22:27:04
注意,__TABLES__不是一个实际的BigQuery表,而是它的元数据的视图。解决此问题的一种方法是执行以下命令:
spark = SparkSession.builder\
.config('parentProject', 'my-parent-project')\
.config('viewsEnabled','true')\
.config('materializationDataset', DATASET)
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1')\
.getOrCreate()
tables_df = spark.read.format('bigquery')\
.option("credentials", my_key)\
.load("SELECT * FROM my-project.my-dataset.__TABLES__")https://stackoverflow.com/questions/68162281
复制相似问题