首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Flink尝试从已删除的目录中恢复检查点

Flink尝试从已删除的目录中恢复检查点
EN

Stack Overflow用户
提问于 2022-05-25 11:10:46
回答 3查看 245关注 0票数 1

清除s3桶(用于从旧文件(已访问了一个多月的文件)中存储检查点)、重新启动或从实际检查点恢复时,由于一些旧文件的丢失,一些作业进程不会启动。

作业运行良好,并保存实际检查点(保存路径s3://flink-checkpoints/check/af8b0712ae0c1f20d2226b86e6bddb60/chk-100274) )

代码语言:javascript
复制
2022-04-24 03:58:32.892 Triggering checkpoint 100273 @ 1653353912890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 03:58:55.317 Completed checkpoint 100273 for job af8b0712ae0c1f20d2226b86e6bddb60 (679053131 bytes in 22090 ms).
2022-04-24 04:03:32.892 Triggering checkpoint 100274 @ 1653354212890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 04:03:35.844 Completed checkpoint 100274 for job af8b0712ae0c1f20d2226b86e6bddb60 (9606712 bytes in 2494 ms).

在一个任务管理器关闭并重新启动作业之后

代码语言:javascript
复制
2022-04-24 04:04:40.936 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:14.150 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RESTARTING to RUNNING.
2022-04-24 04:05:14.198 Restoring job af8b0712ae0c1f20d2226b86e6bddb60 from latest valid checkpoint: Checkpoint 100274 @ 1653354212890 for af8b0712ae0c1f20d2226b86e6bddb60.

一段时间后,作业失败,因为某些进程无法恢复状态

代码语言:javascript
复制
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:17.093 Process first events -> Sink: Sink to test-job (5/10) (4f9089b1015540eb6e13afe4c07fa97b) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_f1d5710fb330fd579d15b292e305802c_(5/10) from any of the 1 provided restore options.
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default), S3 Extended Request ID: 51f03da-default-default (Path: s3://flink-checkpoints/check/e3d82336005fc40be9af536938716199/shared/64452a30-c8a0-454f-8164-34d9e70142e0)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default)
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.

如果我完全取消作业并启动一个新作业,并将保存点设置为上一个检查点的路径,则会得到相同的错误。

为什么在使用af8b0712ae0c1f20d2226b86e6bddb60文件夹中的检查点时,作业试图从e3d82336005fc40be9af536938716199文件夹中获取一些文件,以及从存储中清除旧检查点的规则是什么?

更新后,我发现flink为chk-*/_ s3文件中的所有TaskManager的rocksdb文件保存了路径。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2022-05-27 14:05:26

我发现flink在chk-*/_ s3文件中为所有TaskManager的rocksdb文件保存了路径。

票数 0
EN

Stack Overflow用户

发布于 2022-05-25 11:28:51

这是很长一段时间以来相当模棱两可的事情,最近在Flink 1.15中已经讨论过。我建议阅读https://flink.apache.org/news/2022/05/05/1.15-announcement.html关于“澄清检查点和保存点语义”一节,包括对检查点和保存点进行比较的部分。

您所经历的行为取决于您的检查点设置(对齐还是对齐)。

票数 0
EN

Stack Overflow用户

发布于 2022-05-26 12:09:20

默认情况下,取消作业移除旧检查点。有一个配置标志来控制它,execution.checkpointing.externalized-checkpoint-retention。正如Martijn所提到的,通常您会求助于保存点来控制作业升级/重新启动。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72376547

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档