首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Python/Pyspark中使用Apache Hudi

在Python/Pyspark中使用Apache Hudi
EN

Stack Overflow用户
提问于 2020-03-30 21:25:07
回答 1查看 2.4K关注 0票数 0

有人在Pyspark环境中使用过Apache Hudi吗?如果可能的话,有没有可用的代码样本?

EN

回答 1

Stack Overflow用户

发布于 2021-03-22 21:16:11

以下是包含插入、更新和读取操作的工作pyspark示例:

代码语言:javascript
复制
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = (
    SparkSession.builder.appName("Hudi_Data_Processing_Framework")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.hive.convertMetastoreParquet", "false")
    .config(
        "spark.jars.packages",
        "org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.2"
    )
    .getOrCreate()
)

input_df = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ("id", "creation_date", "last_update_time"),
)

hudi_options = {
    # ---------------DATA SOURCE WRITE CONFIGS---------------#
    "hoodie.table.name": "hudi_test",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.precombine.field": "last_update_time",
    "hoodie.datasource.write.partitionpath.field": "creation_date",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.upsert.shuffle.parallelism": 1,
    "hoodie.insert.shuffle.parallelism": 1,
    "hoodie.consistency.check.enabled": True,
    "hoodie.index.type": "BLOOM",
    "hoodie.index.bloom.num_entries": 60000,
    "hoodie.index.bloom.fpp": 0.000000001,
    "hoodie.cleaner.commits.retained": 2,
}

# INSERT
(
    input_df.write.format("org.apache.hudi")
    .options(**hudi_options)
    .mode("append")
    .save("/tmp/hudi_test")
)

#UPDATE
update_df = input_df.limit(1).withColumn("last_update_time", lit("2016-01-01T13:51:39.340396Z"))
(
    update_df.write.format("org.apache.hudi")
    .options(**hudi_options)
    .mode("append")
    .save("/tmp/hudi_test")
)

# READ
output_df = spark.read.format("org.apache.hudi").load(
    "/tmp/hudi_test/*/*"
)

output_df.show()
票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60931567

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档