我正试图对我的服务进行零关机升级。到目前为止,我一直没有成功。根据负载平衡器的健康检查,负载平衡器正在将流量引导到旧的实例,尽管它们是、不健康的。我在使用terraform和gcp。要升级的实际服务需要终止TLS连接,因此该服务需要使用网络负载均衡器,即目标池。区域实例组管理器是为了确保在区域崩溃的情况下冗余。
terraform的玩具版本,其中实例的数量减少了,但显示了问题。
variable "project" {
type = string
}
variable "region" {
type = string
default = "us-central1"
}
provider "google" {
project = var.project
region = var.region
}
resource "google_compute_region_instance_group_manager" "default" {
base_instance_name = "instance"
name = "default"
region = var.region
target_size = 3
target_pools = [
google_compute_target_pool.default.self_link,
]
update_policy {
minimal_action = "REPLACE"
type = "PROACTIVE"
max_surge_fixed = 3
max_unavailable_fixed = 0
min_ready_sec = 120
}
version {
instance_template = google_compute_instance_template.template-b.self_link
}
}
resource "google_compute_address" "default" {
name = "default"
}
resource "google_compute_target_pool" "default" {
name = "default"
region = var.region
instances = []
health_checks = [
google_compute_http_health_check.default.self_link
]
lifecycle {
ignore_changes = [
instances
]
}
}
resource "google_compute_http_health_check" "default" {
name = "default"
request_path = "/"
check_interval_sec = 1
timeout_sec = 1
healthy_threshold = 3
unhealthy_threshold = 1
}
resource "google_compute_forwarding_rule" "default" {
name = "default"
region = var.region
ip_protocol = "TCP"
port_range = "80"
target = google_compute_target_pool.default.self_link
ip_address = google_compute_address.default.address
}
data "google_compute_network" "default" {
name = "default"
}
resource "google_compute_instance_template" "template-b" {
name = "template-b1"
machine_type = "f1-micro"
disk {
boot = true
auto_delete = true
disk_size_gb = 100
disk_type = "pd-ssd"
source_image = data.google_compute_image.my_image.self_link
}
network_interface {
network = data.google_compute_network.default.self_link
}
metadata_startup_script = file("./startup-scripts/helloworld.sh")
metadata = {
instance-env = "SOFTWARE_VERSION=Version-B"
}
tags = [
"http-server"
]
lifecycle {
create_before_destroy = true
}
}
data "google_compute_image" "my_image" {
family = "ubuntu-1804-lts"
project = "ubuntu-os-cloud"
}
output "ip-address" {
value = google_compute_address.default.address
}启动脚本,它将弹出在每个实例上运行的服务器。startup-scripts/helloworld.sh
#!/bin/bash -x
METADATA_BASE=http://metadata.google.internal/computeMetadata/v1
SOFTWARE_VERSION=$(curl -sfm5 -H "Metadata-Flavor: Google" ${METADATA_BASE}/instance/attributes/instance-env)
echo "Hello World! This is ${SOFTWARE_VERSION} from $(hostname -f)" > index.html
python3 -m http.server 80 &我遇到的问题是,当我将实例的数量从6个减少到3个时,我可以看到一些实例被健康检查标记为不健康,但目标池仍然将流量引导到这些实例。这方面的文档意味着这些实例不应该看到任何通信量。
在将数字实例从6调整为3的过程中,我运行了两个shell脚本并获得了以下结果-- while [[ 1 ]]; do echo -n "$(date +%s) "; curl -m5 http://${IP_ADDRESS} && sleep 1; done
no timeouts
...
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outswhile [[ 1 ]]; do echo -n "$(date +%s) "; gcloud compute target-pools get-health default --region us-central1 && sleep 1; done
all six instances are healthy
...
1589838263 ---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
1589838266 ---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-f/instances/instance-rvcl
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-562f
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: UNHEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-42dm
kind: compute#targetPoolInstanceHealth
...
unhealthy for a bit
...
1589838312 ---
healthStatus:
- healthState: HEALTHY
instance: v1/projects/my-project/zones/us-central1-f/instances/instance-xss8
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-c/instances/instance-w47x
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth
---
healthStatus:
- healthState: HEALTHY
instance: compute/v1/projects/my-project/zones/us-central1-b/instances/instance-wh9v
ipAddress: <IP_ADDRESS>
kind: compute#targetPoolInstanceHealth从脚本中的时间戳可以看出,不健康的实例仍然获得了通信量。
这种获取流量的不健康实例的模式在以terraform的形式将实例模板替换为不同的实例组管理器时可以看到。在打开第二个区域实例组管理器并将其添加到目标池,等待通信量到达这些新实例,然后从目标池中移除旧的区域实例组管理器时,也可以看到。我还尝试使用它自己的实例组管理器来打开第二个目标池,然后更改转发规则,但是在那里我看到了超过一分钟的停机时间,甚至没有通信流到任何一个区域实例组。
我能做些什么来避免这样的停机?
发布于 2020-05-20 15:40:43
从日志中可以看到,有三十秒的间隔,在此期间,负载均衡器不断向无响应的实例发送请求。
gcloud compute target-pools get-health提供了以下时间戳和健康状态:
Timestamp 1589838263
instance-rvcl UNHEALTHY
instance-42dm UNHEALTHY
instance-562f HEALTHY
Timestamp 1589838266
instance-rvcl UNHEALTHY
instance-42dm UNHEALTHY
instance-562f UNHEALTHY与健康状态合并的curl输出:
no timeouts
...
1589838263 # instance-562f still HEALTHY, the last response
1589838264 Hello World! This is SOFTWARE_VERSION=Version-B from instance-562f.c.my-project.internal
1589838265 Hello World! This is SOFTWARE_VERSION=Version-B from instance-42dm.c.my-project.internal
1589838266 # relative time +0s, instances -rvcl,42dm,562f are UNHEALTHY
1589838267 curl: (52) Empty reply from server
1589838267 curl: (28) Connection timed out after 5004 milliseconds
1589838272 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838273 curl: (28) Connection timed out after 5004 milliseconds
1589838278 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
1589838279 curl: (28) Connection timed out after 5004 milliseconds
1589838284 Hello World! This is SOFTWARE_VERSION=Version-B from instance-wh9v.c.my-project.internal
1589838285 curl: (28) Connection timed out after 5004 milliseconds
1589838290 Hello World! This is SOFTWARE_VERSION=Version-B from instance-w47x.c.my-project.internal
1589838292 curl: (28) Connection timed out after 5003 milliseconds
1589838297 curl: (28) Connection timed out after 5003 milliseconds # relative time +30s; instances -rvcl,42dm,562f still UNHEALTHY
1589838302 Hello World! This is SOFTWARE_VERSION=Version-B from instance-xss8.c.my-project.internal
...
no time outs这可能是由于负载均衡器需要识别实例不健康的延迟。在此期间,负载均衡器不断向实例发送新请求。
一旦一个实例变得不健康,负载均衡器就停止在那里发送新的连接。但是,在关闭脚本关闭实例之前,无法终止现有连接。对于正常实例,停工期是90秒。
以下是健康检查时间表的一个示例:
Load Balancing > Doc > Health checks overview > Example health check
另请参阅
Compute Engine > Doc > Understanding autoscaler decisions > Preparing for instance terminations
Load Balancing > Doc > Health checks overview > How health checks work > Health state
https://stackoverflow.com/questions/61898004
复制相似问题