文章/答案/技术大牛

发布

社区首页 >问答首页 >尝试零停机时间升级，但Targetpool正在将通信引导到正在被拆除的实例。

问尝试零停机时间升级，但Targetpool正在将通信引导到正在被拆除的实例。
EN

Stack Overflow用户

提问于 2020-05-19 18:18:04

回答 1查看 127关注 0票数 0

我正试图对我的服务进行零关机升级。到目前为止，我一直没有成功。根据负载平衡器的健康检查，负载平衡器正在将流量引导到旧的实例，尽管它们是、不健康的。我在使用terraform和gcp。要升级的实际服务需要终止TLS连接，因此该服务需要使用网络负载均衡器，即目标池。区域实例组管理器是为了确保在区域崩溃的情况下冗余。

terraform的玩具版本，其中实例的数量减少了，但显示了问题。

variable "project" {
  type = string
}
variable "region" {
  type = string
  default = "us-central1"
}

provider "google" {
  project = var.project
  region = var.region
}

resource "google_compute_region_instance_group_manager" "default" {
  base_instance_name = "instance"
  name = "default"

  region = var.region
  target_size = 3
  target_pools = [
    google_compute_target_pool.default.self_link,
  ]

  update_policy {
    minimal_action = "REPLACE"
    type = "PROACTIVE"
    max_surge_fixed = 3
    max_unavailable_fixed = 0
    min_ready_sec = 120
  }

  version {
    instance_template = google_compute_instance_template.template-b.self_link
  }
}

resource "google_compute_address" "default" {
  name = "default"
}

resource "google_compute_target_pool" "default" {
  name = "default"
  region = var.region
  instances = []
  health_checks = [
    google_compute_http_health_check.default.self_link
  ]
  lifecycle {
    ignore_changes = [
      instances
    ]
  }
}

resource "google_compute_http_health_check" "default" {
  name = "default"
  request_path        = "/"
  check_interval_sec  = 1
  timeout_sec         = 1
  healthy_threshold   = 3
  unhealthy_threshold = 1
}

resource "google_compute_forwarding_rule" "default" {
  name = "default"
  region = var.region
  ip_protocol = "TCP"
  port_range = "80"
  target = google_compute_target_pool.default.self_link
  ip_address = google_compute_address.default.address
}

data "google_compute_network" "default" {
  name = "default"
}

resource "google_compute_instance_template" "template-b" {
  name = "template-b1"
  machine_type = "f1-micro"

  disk {
    boot = true
    auto_delete = true
    disk_size_gb = 100
    disk_type = "pd-ssd"
    source_image = data.google_compute_image.my_image.self_link
  }

  network_interface {
    network = data.google_compute_network.default.self_link
  }

  metadata_startup_script = file("./startup-scripts/helloworld.sh")

  metadata = {
    instance-env = "SOFTWARE_VERSION=Version-B"
  }

  tags = [
    "http-server"
  ]

  lifecycle {
    create_before_destroy = true
  }
}

data "google_compute_image" "my_image" {
  family  = "ubuntu-1804-lts"
  project = "ubuntu-os-cloud"
}

output "ip-address" {
  value = google_compute_address.default.address
}

启动脚本，它将弹出在每个实例上运行的服务器。startup-scripts/helloworld.sh

#!/bin/bash -x
METADATA_BASE=http://metadata.google.internal/computeMetadata/v1
SOFTWARE_VERSION=$(curl -sfm5 -H "Metadata-Flavor: Google" ${METADATA_BASE}/instance/attributes/instance-env)
echo "Hello World! This is ${SOFTWARE_VERSION} from $(hostname -f)" > index.html
python3 -m http.server 80 &

我遇到的问题是，当我将实例的数量从6个减少到3个时，我可以看到一些实例被健康检查标记为不健康，但目标池仍然将流量引导到这些实例。这方面的文档意味着这些实例不应该看到任何通信量。

在将数字实例从6调整为3的过程中，我运行了两个shell脚本并获得了以下结果-- while [[ 1 ]]; do echo -n "$(date +%s) "; curl -m5 http://${IP_ADDRESS} && sleep 1; done

no timeouts
...
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs

while [[ 1 ]]; do echo -n "$(date +%s) "; gcloud compute target-pools get-health default --region us-central1 && sleep 1; done

all six instances are healthy
...
1589838263 ---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
1589838266 ---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
...
unhealthy for a bit
...
1589838312 ---
healthStatus:
- healthState: HEALTHY
  instance: v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth

从脚本中的时间戳可以看出，不健康的实例仍然获得了通信量。

这种获取流量的不健康实例的模式在以terraform的形式将实例模板替换为不同的实例组管理器时可以看到。在打开第二个区域实例组管理器并将其添加到目标池，等待通信量到达这些新实例，然后从目标池中移除旧的区域实例组管理器时，也可以看到。我还尝试使用它自己的实例组管理器来打开第二个目标池，然后更改转发规则，但是在那里我看到了超过一分钟的停机时间，甚至没有通信流到任何一个区域实例组。

我能做些什么来避免这样的停机？

google-cloud-platform

google-compute-engine

terraform

terraform-provider-gcp

回答 1

Stack Overflow用户

发布于 2020-05-20 15:40:43

从日志中可以看到，有三十秒的间隔，在此期间，负载均衡器不断向无响应的实例发送请求。

gcloud compute target-pools get-health提供了以下时间戳和健康状态：

Timestamp 1589838263
instance-rvcl   UNHEALTHY
instance-42dm   UNHEALTHY
instance-562f   HEALTHY

Timestamp 1589838266
instance-rvcl   UNHEALTHY
instance-42dm   UNHEALTHY
instance-562f   UNHEALTHY

与健康状态合并的curl输出：

no timeouts
...
1589838263      # instance-562f still HEALTHY, the last response
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838266      # relative time +0s, instances -rvcl,42dm,562f are UNHEALTHY 
1589838267 curl: (52) Empty reply from server 
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds  # relative time +30s; instances -rvcl,42dm,562f still UNHEALTHY 
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs

这可能是由于负载均衡器需要识别实例不健康的延迟。在此期间，负载均衡器不断向实例发送新请求。

一旦一个实例变得不健康，负载均衡器就停止在那里发送新的连接。但是，在关闭脚本关闭实例之前，无法终止现有连接。对于正常实例，停工期是90秒。

以下是健康检查时间表的一个示例：

Load Balancing > Doc > Health checks overview > Example health check

另请参阅

Compute Engine > Doc > Understanding autoscaler decisions > Preparing for instance terminations

Load Balancing > Doc > Health checks overview > How health checks work > Health state

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61898004

复制

相似问题

问尝试零停机时间升级，但Targetpool正在将通信引导到正在被拆除的实例。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尝试零停机时间升级，但Targetpool正在将通信引导到正在被拆除的实例。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尝试零停机时间升级，但Targetpool正在将通信引导到正在被拆除的实例。
EN