我试着用地形来写一个基于日志的警报策略。
每当日志中出现某条消息时,我希望生成一个几乎实时的警报。具体来说,我想知道作曲家DAG何时失败。
我成功地使用以下查询筛选器在控制台中设置了基于日志的警报:
resource.type="cloud_composer_environment"
severity="ERROR"
log_name="projects/my_project/logs/airflow-scheduler"
resource.labels.project_id="project-id"
textPayload=~"my_dag_name"但是,我很难将这个基于"google_monitoring_alert_policy".的基于日志的警报策略转换为terraform。
我尝试将以下过滤条件添加到terraform google_monitoring_alert_policy中
filter = "resource.type=cloud_composer_environment AND resource.label.project_id=${var.project} AND log_name=projects/${var.project}/logs/airflow-scheduler AND severity=ERROR AND textPayload=~my_dag_name"但是,在运行terraform apply时,我会得到以下错误:
build 10-Nov-2022 12:21:00 [31mâ[0m [0m[1m[31mError: [0m[0m[1mError creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "resource.type=cloud_composer_environment AND resource.labels.project_id=my_project AND log_name=projects/my_project/logs/airflow-scheduler AND severity=ERROR AND textPayload=my_dag_name": The lefthand side of each expression must be prefixed with one of {group, metadata, metric, project, resource}.[0m所以我有两个问题:
。
发布于 2022-11-10 14:13:43
正如我所看到的,您希望创建一个log based metric。
在这种情况下,首先需要用Terraform创建这个log based metric:
在json文件中配置了度量的示例,logging_metrics.json:
{
"metrics": {
"composer_dags_tasks_bigquery_errors": {
"name": "composer_dags_tasks_bigquery_errors",
"filter": "severity=ERROR AND resource.type=\"cloud_composer_environment\" AND textPayload =~ \"{taskinstance.py:.*} ERROR -.*bigquery.googleapis.com/bigquery/v2/projects\"",
"description": "Metric for Cloud Composer Bigquery tasks errors.",
"metric_descriptor": {
"metric_kind": "DELTA",
"value_type": "INT64",
"labels": [
{
"key": "task_id",
"value_type": "STRING",
"description": "Task ID of current Airflow task",
"extractor": "EXTRACT(labels.\"task-id\")"
},
{
"key": "execution_date",
"value_type": "STRING",
"description": "Execution date of the current Airflow task",
"extractor": "EXTRACT(labels.\"execution-date\")"
}
]
}
}
}
}此度量筛选Composer日志中的Composer错误。我使用label提取器在DAG task_id和Taskexecution_date上根据这些参数使此度量具有唯一性。
检索locals.tf文件中的度量:
locals {
logging_metrics = jsondecode(file("${path.module}/resource/logging_metrics.json"))["metrics"]
}resource "google_logging_metric" "logging_metrics" {
for_each = local.logging_metrics
project = var.project_id
name = "${each.value["name"]}"
filter = each.value["filter"]
description = each.value["description"]
metric_descriptor {
metric_kind = each.value["metric_descriptor"]["metric_kind"]
value_type = each.value["metric_descriptor"]["value_type"]
dynamic "labels" {
for_each = try(each.value["metric_descriptor"]["labels"], [])
content {
key = try(labels.value["key"], null)
value_type = try(labels.value["value_type"], null)
description = try(labels.value["description"], null)
}
}
}
label_extractors = {for label in try(each.value["metric_descriptor"]["labels"], []): label.key => label.extractor}
}然后根据前面的log based metric创建警报资源:
resource "google_monitoring_alert_policy" "alert_policy" {
project = var.project_id
display_name = "alert_name"
combiner = "..."
conditions {
display_name = "alert_name"
condition_threshold {
filter = "metric.type=\"logging.googleapis.com/user/composer_dags_tasks_bigquery_errors\" AND resource.type=\"cloud_composer_environment\""
...........
}警报策略资源使用上一次通过metric.type创建的metric.type。
https://stackoverflow.com/questions/74389049
复制相似问题