下面是我正在使用的示例数据帧(df):
+---+----+--------+
| id|orig|scrubbed|
+---+----+--------+
| 1| a| a|
| 2| B| b|
| 3| c| c|
| 4| D| d|
| 5| *| XX|
| 6| $| XX|
| 7| ZZ| ZZ|
| 8| XX| XX|
| 9| y| y|
| 10| Z| z|
+---+----+--------+我想执行一个检查,告诉我在清理后“填充”(不包含"XX“或"ZZ")的项的比例是否至少为80%。(此检查应失败。)我可以向VerificationRunBuilder添加一个合规性分析器来计算指标,如下所示:
val myVerificationResult: VerificationResult = new VerificationRunBuilder(df).
addRequiredAnalyzer(
Compliance(
"populatedAfterScrubbing",
"`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
Some("`orig` NOT IN ('ZZ', 'XX') AND `orig` IS NOT NULL")
)
).
addCheck(
Check(CheckLevel.Error, "Review Check").
hasSize(_ >= 1)
).
run()这段代码运行并使用hasSize约束成功地检查了数据,但我不知道如何根据自定义的遵从性分析器添加约束。这个是可能的吗?
发布于 2020-05-16 00:14:38
我找到了一个似乎有效的解决方案,如果有人感兴趣的话。答案在于创建自定义约束,而不是自定义分析器。以下是工作代码:
val myConstraint = Constraint.complianceConstraint(
"my constraint",
"`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
(fraction:Double)=>fraction>=0.8,
Some("`orig` NOT IN ('ZZ', 'XX') AND `orig` IS NOT NULL"),
Some("no peeking")
)
val myVerificationResult: VerificationResult = { VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Review Check")
.addConstraint(myConstraint)
)
.run()
}
val result = checkResultsAsDataFrame(spark, myVerificationResult)
result.show(truncate=true)结果与预期完全一致:
+------------+-----------+------------+--------------------+-----------------+--------------------+
| check|check_level|check_status| constraint|constraint_status| constraint_message|
+------------+-----------+------------+--------------------+-----------------+--------------------+
|Review Check| Error| Error|ComplianceConstra...| Failure|Value: 0.75 does ...|
+------------+-----------+------------+--------------------+-----------------+--------------------+发布于 2020-08-28 02:16:34
这不能只通过检查来完成吗?使用类似于这样的statisfies https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/checks/Check.scala#L667
Check(CheckLevel.Warning, "Statisfies TEST Constraint")
.satisfies("`scrubbed` NOT IN ('ZZ', 'XX') AND `scrubbed` IS NOT NULL",
"my constraint",
"fraction:Double",(fraction:Double)=>fraction>=0.8,
Some("..."))
))我认为这是OOB,而不是通过遵从性约束来定义,尽管如果你有一个复杂的逻辑,这也是一个食物的想法。
https://stackoverflow.com/questions/61806431
复制相似问题