文章/答案/技术大牛

发布

社区首页 >问答首页 >无法从AWS中的PySpark脚本查询冰山表

问无法从AWS中的PySpark脚本查询冰山表
EN

Stack Overflow用户

提问于 2022-07-27 17:11:55

回答 1查看 785关注 0票数 0

--我试图从冰山表中读取数据，数据采用ORC格式，并按列进行分区。我得到了这个错误-

org.apache.hadoop.hive.ql.metadata.HiveException:无法获取表temp_tag_thrshld_iceberg。StorageDescriptor#InputFormat不能对表temp_tag_thrshld_iceberg为空(服务: null；状态代码: 0；错误代码: null；请求ID: null；代理: null)

这是我的代码:

spark = SparkSession.builder.config("spark.driver.memory", "25g").appName(app_name).getOrCreate()
temp_tag_thrshld_data = spark.sql("SELECT * FROM dev_db.temp_tag_thrshld_iceberg")

如果我替换了我的spark.sql("Select * from a_normal_athena_table“)，代码就能正常运行。我也不能直接从S3读取数据，因为它是一种采用快速压缩的ORC格式，所以我没有任何结果(我可能错过了直接读取S3 ORC的正确框架，但这是另一个问题)。

我已经尝试过使用

aws glue get-table --database-name dev_db --name temp_tag_thrshld_iceberg

这是我得到的输出-

{“表”：{“名称”："temp_tag_thrshld_iceberg“、"DatabaseName"："dev_db”、"CreateTime"：1658864256.0、"UpdateTime"：1658864347.0、“保留”：0、"StorageDescriptor"：{“列”：{“名称”：“标签”、“类型”："int“、”参数“：{ "iceberg.field.current"：”真“，"iceberg.field.id"："1"，"iceberg.field.optional"："true“}，{ "Name"："zipcode"，"Type"："int"，"iceberg.field.current"：{”iceberg.field.current“："true"，"iceberg.field.id"："2"，"iceberg.field.optional"："true”}，{“名称”："threshold_max"，"Type"："double"，“参数”：{ "iceberg.field.current"：“真”、"iceberg.field.id"："3“、"iceberg.field.optional"："true”}、{ "Name"："level“、"Type"："string”、"iceberg.field.current"："true“、"iceberg.field.id"："4”、"iceberg.field.optional"："true“}，“位置”："s3://dev_db/athena-tables/temp_tag_thrshld_iceberg"，“压缩”：false，"NumberOfBuckets"：0，"SortColumns"：[]，"StoredAsSubDirectories"：false }，"TableType"："EXTERNAL_TABLE"，“参数”：{ "metadata_location"："s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00001-0ee5fbc7-044e-439d-aa1e-d76935002ebd.metadata.json"，"previous_metadata_location"：previous_metadata_location "table_type"：“冰山”}，"CreatedBy"："IAM“，"IsRegisteredWithLakeFormation"：false，"CatalogId"："VersionId"："1”}

将配置更新为以下内容(基于冰山表配置)：

spark = SparkSession.builder.config("spark.driver.memory", "25g")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.appName(app_name).getOrCreate()

我发现了新的错误-

调用o87.sql时发生错误。无法找到目录的“星火_目录”的目录插件类: org.apache.iceberg.spark.SparkSessionCatalog

amazon-web-services

apache-spark

pyspark

aws-glue

iceberg

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-07-28 02:10:11

要阅读Glue中的冰山表，您必须使用下面的连接器。

https://aws.amazon.com/marketplace/pp/prodview-iicxofvpqvsio

下面是一个供您参考的博客，详细介绍了如何使用AWS Glue从冰山中获取数据。

https://aws.amazon.com/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73142069

复制

相似问题

问无法从AWS中的PySpark脚本查询冰山表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法从AWS中的PySpark脚本查询冰山表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法从AWS中的PySpark脚本查询冰山表
EN