首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >尝试零停机时间升级,但Targetpool正在将通信引导到正在被拆除的实例。

尝试零停机时间升级,但Targetpool正在将通信引导到正在被拆除的实例。
EN

Stack Overflow用户
提问于 2020-05-19 18:18:04
回答 1查看 127关注 0票数 0

我正试图对我的服务进行零关机升级。到目前为止,我一直没有成功。根据负载平衡器的健康检查,负载平衡器正在将流量引导到旧的实例,尽管它们是、不健康的。我在使用terraform和gcp。要升级的实际服务需要终止TLS连接,因此该服务需要使用网络负载均衡器,即目标池。区域实例组管理器是为了确保在区域崩溃的情况下冗余。

terraform的玩具版本,其中实例的数量减少了,但显示了问题。

代码语言:javascript
复制
variable "project" {
  type = string
}
variable "region" {
  type = string
  default = "us-central1"
}

provider "google" {
  project = var.project
  region = var.region
}

resource "google_compute_region_instance_group_manager" "default" {
  base_instance_name = "instance"
  name = "default"

  region = var.region
  target_size = 3
  target_pools = [
    google_compute_target_pool.default.self_link,
  ]

  update_policy {
    minimal_action = "REPLACE"
    type = "PROACTIVE"
    max_surge_fixed = 3
    max_unavailable_fixed = 0
    min_ready_sec = 120
  }

  version {
    instance_template = google_compute_instance_template.template-b.self_link
  }
}

resource "google_compute_address" "default" {
  name = "default"
}

resource "google_compute_target_pool" "default" {
  name = "default"
  region = var.region
  instances = []
  health_checks = [
    google_compute_http_health_check.default.self_link
  ]
  lifecycle {
    ignore_changes = [
      instances
    ]
  }
}

resource "google_compute_http_health_check" "default" {
  name = "default"
  request_path        = "/"
  check_interval_sec  = 1
  timeout_sec         = 1
  healthy_threshold   = 3
  unhealthy_threshold = 1
}

resource "google_compute_forwarding_rule" "default" {
  name = "default"
  region = var.region
  ip_protocol = "TCP"
  port_range = "80"
  target = google_compute_target_pool.default.self_link
  ip_address = google_compute_address.default.address
}

data "google_compute_network" "default" {
  name = "default"
}

resource "google_compute_instance_template" "template-b" {
  name = "template-b1"
  machine_type = "f1-micro"

  disk {
    boot = true
    auto_delete = true
    disk_size_gb = 100
    disk_type = "pd-ssd"
    source_image = data.google_compute_image.my_image.self_link
  }

  network_interface {
    network = data.google_compute_network.default.self_link
  }

  metadata_startup_script = file("./startup-scripts/helloworld.sh")

  metadata = {
    instance-env = "SOFTWARE_VERSION=Version-B"
  }

  tags = [
    "http-server"
  ]

  lifecycle {
    create_before_destroy = true
  }
}

data "google_compute_image" "my_image" {
  family  = "ubuntu-1804-lts"
  project = "ubuntu-os-cloud"
}

output "ip-address" {
  value = google_compute_address.default.address
}

启动脚本,它将弹出在每个实例上运行的服务器。startup-scripts/helloworld.sh

代码语言:javascript
复制
#!/bin/bash -x
METADATA_BASE=http://metadata.google.internal/computeMetadata/v1
SOFTWARE_VERSION=$(curl -sfm5 -H "Metadata-Flavor: Google" ${METADATA_BASE}/instance/attributes/instance-env)
echo "Hello World! This is ${SOFTWARE_VERSION} from $(hostname -f)" > index.html
python3 -m http.server 80 &

我遇到的问题是,当我将实例的数量从6个减少到3个时,我可以看到一些实例被健康检查标记为不健康,但目标池仍然将流量引导到这些实例。这方面的文档意味着这些实例不应该看到任何通信量。

在将数字实例从6调整为3的过程中,我运行了两个shell脚本并获得了以下结果-- while [[ 1 ]]; do echo -n "$(date +%s) "; curl -m5 http://${IP_ADDRESS} && sleep 1; done

代码语言:javascript
复制
no timeouts
...
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs

while [[ 1 ]]; do echo -n "$(date +%s) "; gcloud compute target-pools get-health default --region us-central1 && sleep 1; done

代码语言:javascript
复制
all six instances are healthy
...
1589838263 ---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
1589838266 ---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
...
unhealthy for a bit
...
1589838312 ---
healthStatus:
- healthState: HEALTHY
  instance: v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
  instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
  ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth

从脚本中的时间戳可以看出,不健康的实例仍然获得了通信量。

这种获取流量的不健康实例的模式在以terraform的形式将实例模板替换为不同的实例组管理器时可以看到。在打开第二个区域实例组管理器并将其添加到目标池,等待通信量到达这些新实例,然后从目标池中移除旧的区域实例组管理器时,也可以看到。我还尝试使用它自己的实例组管理器来打开第二个目标池,然后更改转发规则,但是在那里我看到了超过一分钟的停机时间,甚至没有通信流到任何一个区域实例组。

我能做些什么来避免这样的停机?

EN

回答 1

Stack Overflow用户

发布于 2020-05-20 15:40:43

从日志中可以看到,有三十秒的间隔,在此期间,负载均衡器不断向无响应的实例发送请求。

gcloud compute target-pools get-health提供了以下时间戳和健康状态:

代码语言:javascript
复制
Timestamp 1589838263
instance-rvcl   UNHEALTHY
instance-42dm   UNHEALTHY
instance-562f   HEALTHY

Timestamp 1589838266
instance-rvcl   UNHEALTHY
instance-42dm   UNHEALTHY
instance-562f   UNHEALTHY

与健康状态合并的curl输出:

代码语言:javascript
复制
no timeouts
...
1589838263      # instance-562f still HEALTHY, the last response
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838266      # relative time +0s, instances -rvcl,42dm,562f are UNHEALTHY 
1589838267 curl: (52) Empty reply from server 
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds  # relative time +30s; instances -rvcl,42dm,562f still UNHEALTHY 
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs

这可能是由于负载均衡器需要识别实例不健康的延迟。在此期间,负载均衡器不断向实例发送新请求。

一旦一个实例变得不健康,负载均衡器就停止在那里发送新的连接。但是,在关闭脚本关闭实例之前,无法终止现有连接。对于正常实例,停工期是90秒。

以下是健康检查时间表的一个示例:

Load Balancing > Doc > Health checks overview > Example health check

另请参阅

Compute Engine > Doc > Understanding autoscaler decisions > Preparing for instance terminations

Load Balancing > Doc > Health checks overview > How health checks work > Health state

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61898004

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档