0. 写在前面:为什么你需要“神器”而非“常用命令
大家好,我是老杨 欢迎点击原文链接或直接访问vps.top365app.com,来看我的全球vps信息实时分析项目,另外觉得文章不错要记得.点赞、转发、在看以及打开小星标哦,攒今世之功德,修来世之福报
很多公司在安全上花了大价钱,但最后出问题的往往不是黑客,而是内部人员。有人权限太大,有人随意操作,甚至有人故意搞破坏。零信任模型就是为了解决这种情况:不要默认相信任何人或任何机器,哪怕他在内网。 这篇文章就聊聊,运维环境怎么一步步把零信任落下来,避免内部滥用。
静态公钥一放三年,谁也说不清还在不在用。把 SSH 切到证书化,登录凭据从“永久钥匙”变成“短命令牌”,再把 MFA 嵌到链路里,滥用的成本会直线上升。
1)生成 SSH CA 并签发短期用户证书(有效期 30 分钟)
$ sudo ssh-keygen -f /etc/ssh/ssh_ca -t ed25519 -N ''
Generating public/private ed25519 key pair.
Your identification has been saved in /etc/ssh/ssh_ca
Your public key has been saved in /etc/ssh/ssh_ca.pub
$ ssh-keygen -t ed25519 -f ~/.ssh/alice -N ''
Generating public/private ed25519 key pair.
Your identification has been saved in /home/alice/.ssh/alice
Your public key has been saved in /home/alice/.ssh/alice.pub
$ sudo ssh-keygen -s /etc/ssh/ssh_ca -I alice@prod -n alice,ops -V +30m ~/.ssh/alice.pub
Signed user key ~/.ssh/alice-cert.pub: id "alice@prod" serial 0 for alice,ops valid for 30m2)在目标主机信任 CA,并按角色匹配系统用户
$ echo "TrustedUserCAKeys /etc/ssh/ssh_ca.pub" | sudo tee -a /etc/ssh/sshd_config
TrustedUserCAKeys /etc/ssh/ssh_ca.pub
$ printf '%s\n' \
'Match principal "ops"' \
' AuthorizedPrincipalsCommand /usr/local/sbin/map_principal.sh' \
' AuthorizedPrincipalsCommandUser root' | sudo tee -a /etc/ssh/sshd_config
Match principal "ops"
AuthorizedPrincipalsCommand /usr/local/sbin/map_principal.sh
$ sudo bash -c 'cat >/usr/local/sbin/map_principal.sh' <<'EOF'
#!/usr/bin/env bash
# 把证书里的 principal 映射到受限系统账号
if [[ "$1" == "ops" ]]; then
echo "ops-runner"
fi
EOF
$ sudo chmod +x /usr/local/sbin/map_principal.sh
$ sudo systemctl reload sshd3)把 MFA 绑到 SSH(PAM + TOTP 示意)
$ sudo apt-get install -y libpam-google-authenticator >/dev/null 2>&1 || true
$ sudo sed -n '1,120p' /etc/pam.d/sshd | tail -n 5
#@include common-auth
$ echo "auth required pam_google_authenticator.so nullok" | sudo tee -a /etc/pam.d/sshd
auth required pam_google_authenticator.so nullok
$ sudo sed -i 's/^#ChallengeResponseAuthentication no/ChallengeResponseAuthentication yes/' /etc/ssh/sshd_config
$ sudo systemctl reload sshd4)登录体验(证书 + TOTP,一次性口令遮掩)
$ ssh -i ~/.ssh/alice -o CertificateFile=~/.ssh/alice-cert.pub ops@10.0.0.12
Verification code: ******
Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-86-generic x86_64)
Last login: Tue Aug 19 23:41:02 2025 from 10.0.0.5证书到期后的表现很直接:
$ date
Tue Aug 19 23:75:01 UTC 2025 # (模拟时间向后)
$ ssh -i ~/.ssh/alice -o CertificateFile=~/.ssh/alice-cert.pub ops@10.0.0.12
Permission denied (publickey).共享账户是事故的温床。把连接改成“即时签发—短时生效—自动回收”,谁在何时做了什么,账就清了。
1)启用数据库引擎并配置连接(MySQL 示例)
$ export VAULT_ADDR=http://127.0.0.1:8200
$ export VAULT_TOKEN=hvs.******
$ vault secrets enable database
Success! Enabled the database secrets engine at: database/
$ vault write database/config/prod-mysql \
plugin_name=mysql-database-plugin \
connection_url="{{username}}:{{password}}@tcp(db.prod:3306)/" \
username="vault_admin" password="SuperStr0ng!" \
allowed_roles="ro-analytics, rw-hotfix"
Success! Data written to: database/config/prod-mysql2)定义只读角色与租期
$ vault write database/roles/ro-analytics \
db_name=prod-mysql \
creation_statements="CREATE USER '{{name}}'@'%' IDENTIFIED BY '{{password}}';GRANT SELECT ON prod.* TO '{{name}}'@'%';" \
default_ttl=15m max_ttl=1h
Success! Data written to: database/roles/ro-analytics3)申请一次性凭证,观察租约
$ vault read database/creds/ro-analytics
Key Value
--- -----
lease_id database/creds/ro-analytics/8u8bqJp9rT6WmZq7n
lease_duration 15m
lease_renewable true
password x5-pZQH9NqH5aqz5bq
username v-token-ro-analytics-PVnQy3O4
$ mysql -h db.prod -u v-token-ro-analytics-PVnQy3O4 -p'x5-pZQH9NqH5aqz5bq' -e 'SELECT COUNT(*) FROM prod.orders LIMIT 1;'
+----------+
| COUNT(*) |
+----------+
| 125943 |
+----------+4)到点回收或提前吊销
$ vault lease revoke database/creds/ro-analytics/8u8bqJp9rT6WmZq7n
All revocations queued for lease_id: database/creds/ro-analytics/8u8bqJp9rT6WmZq7n
$ mysql -h db.prod -u v-token-ro-analytics-PVnQy3O4 -p'x5-pZQH9NqH5aqz5bq' -e 'SELECT 1;'
ERROR 1045 (28000): Access denied for user 'v-token-ro-analytics-PVnQy3O4'@'%' (using password: YES)服务对服务的访问不能再靠 IP 或端口白名单。给工作负载发身份证(SVID),再把通信用 mTLS 绑起来。
1)注册一条生产工作负载的条目
$ spire-server entry create \
-spiffeID spiffe://example.org/ns/prod/sa/api \
-parentID spiffe://example.org/spire/agent/x509pop/VM-1234 \
-selector k8s_sa:ns:prod -selector k8s_sa:name:api
Entry created
Entry ID : 03e6f49a-7d9a-4a5b-b21b-2a86f9e67a3d
SPIFFE ID : spiffe://example.org/ns/prod/sa/api2)在 Pod 里查看下发证书(SVID)
$ kubectl exec -n prod deploy/api -- \
sh -c 'openssl x509 -in /run/spire/svids/default.pem -noout -subject -dates'
subject= /C=US/O=SPIFFE/URI=spiffe://example.org/ns/prod/sa/api
notBefore=Aug 20 10:01:12 2025 GMT
notAfter=Aug 20 13:01:12 2025 GMT3)用 mTLS 访问下游(示意 curl)
$ curl --cert /run/spire/svids/default.pem \
--key /run/spire/svids/default.key \
--cacert /run/spire/bundle/bundle.crt \
https://orders.prod.svc.cluster.local:8443/healthz
ok证书有效期小时级,轮换自动发生。服务身份不再漂浮。
策略落在文档里是装饰,落在引擎里才是门锁。先在命令行把规则跑通,再接到集群 Admission 或 CI。
1)Rego:禁止对 prod 命名空间执行 kubectl exec,除非用户在 oncall 组
# policy.rego
package kubernetes.admission
default allow = false
allow {
input.request.kind.kind == "PodExecOptions"
input.request.namespace != "prod"
}
allow {
input.request.kind.kind == "PodExecOptions"
input.request.namespace == "prod"
input.request.userInfo.groups[_] == "oncall"
}2)本地验证
$ opa eval -i exec-prod.json -d policy.rego "data.kubernetes.admission.allow"
false
$ opa eval -i exec-staging.json -d policy.rego "data.kubernetes.admission.allow"
true3)接到集群(Gatekeeper 片段示意)
$ kubectl apply -f constraint-template.yaml
constrainttemplate.templates.gatekeeper.sh/k8snoexec created
$ kubectl apply -f constraint-deny-exec-prod.yaml
k8snoexec.constraints.gatekeeper.sh/deny-exec-in-prod created效果就是在“下手之前”就把手拍掉。
横向移动是内部滥用的常见路线。隔离不是全盘阻断,而是按需放行。
1)CiliumNetworkPolicy:只允许 api 调用 mysql 的 3306
# cnp-api-mysql.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: api-to-mysql
namespace: prod
spec:
endpointSelector:
matchLabels:
app: api
egress:
- toEndpoints:
- matchLabels:
app: mysql
toPorts:
- ports:
- port: "3306"
protocol: TCP$ kubectl apply -f cnp-api-mysql.yaml
ciliumnetworkpolicy.cilium.io/api-to-mysql created
$ kubectl -n prod exec deploy/api -- nc -zvw1 mysql.prod.svc 3306
Connection to mysql.prod.svc 3306 port [tcp/mysql] succeeded!
$ kubectl -n prod exec deploy/api -- nc -zvw1 redis.prod.svc 6379
nc: connect to redis.prod.svc port 6379 (tcp) failed: Connection refused有些误操作靠自觉是不够的,最好每个危险动作都能“录屏”。
1)加载审计规则:监听 rm, useradd, iptables 等
$ sudo bash -c 'cat >/etc/audit/rules.d/ops.rules' <<'EOF'
-w /bin/rm -p x -k dangerous-cmd
-w /usr/sbin/useradd -p x -k identity-change
-w /usr/sbin/iptables -p x -k net-change
EOF
$ sudo augenrules --load
No rules
Found rules: 32)执行一次危险命令并检索审计日志
$ sudo rm -rf /tmp/testdir
$ sudo ausearch -k dangerous-cmd --start recent | tail -n 6
type=EXECVE msg=audit(1724167201.123:1024): argc=3 a0="rm" a1="-rf" a2="/tmp/testdir"
type=PROCTITLE msg=audit(1724167201.123:1024): proctitle=726D002D7266002F746D702F74657374646972
type=SYSCALL msg=audit(1724167201.123:1024): arch=c000003e syscall=59 success=yes exe="/usr/bin/rm" auid=1001 uid=0 ...3)开启 sudo I/O 录制(回放级别)
$ sudo bash -c 'printf "%s\n" Defaults\ log_output Defaults\ iolog_dir=/var/log/sudo-io >> /etc/sudoers.d/logio'
$ sudo -u ops-runner sudo -k
$ sudo -u ops-runner sudo -S ls /root <<<"opsrunner_password" >/dev/null 2>&1 || true
$ sudo ls -l /var/log/sudo-io | tail -n 3
drwx------ 2 root root 4096 Aug 20 11:08 00/00/01
-r-------- 1 root root 512 Aug 20 11:08 00/00/01/timing
-r-------- 1 root root 4096 Aug 20 11:08 00/00/01/stdout回放在取证阶段会很有用。
临时提权要像小额借款:拿到快,还得自己会过期。
1)sudoers 中按角色授权,并设置 0 分钟缓存
$ sudo bash -c 'cat >/etc/sudoers.d/ops' <<'EOF'
Defaults timestamp_timeout=0,log_output,iolog_dir=/var/log/sudo-io
Cmnd_Alias RESTART_SAFE = /bin/systemctl restart nginx, /bin/systemctl restart myapp
%oncall ALL=(root) NOPASSWD: RESTART_SAFE
EOF
$ sudo visudo -c
/etc/sudoers: parsed OK
/etc/sudoers.d/ops: parsed OK2)非 oncall 用户尝试,提示拒绝
$ id -nG
dev users
$ sudo systemctl restart nginx
[sudo] password for dev:
Sorry, user dev is not allowed to execute '/bin/systemctl restart nginx' as root on host01.3)把 dev 加到 oncall 组,授予五分钟并自动失效(示意调度器)
$ sudo usermod -aG oncall dev
$ newgrp oncall
$ sudo systemctl restart nginx
# 无输出即成功
# 五分钟后(模拟)
$ sleep 300
$ sudo systemctl restart nginx
[sudo] password for dev:
Sorry, user dev is not allowed to execute '/bin/systemctl restart nginx' as root on host01.组成员的时间窗可以由外部工作流系统定时移除;理由、关联工单一并写日志。
对内 HTTP/SSH 入口加一层按身份放行的代理,少留跳板机,少留明文地址。
1)Envoy 前置,要求 mTLS + JWT(片段)
# envoy-zero-trust.yaml(片段)
static_resources:
listeners:
- name: ingress_https
address: { socket_address: { address: 0.0.0.0, port_value: 8443 } }
filter_chains:
- transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates: [{ certificate_chain: { filename: "server.pem" }, private_key: { filename: "server.key" } }]
validation_context:
trusted_ca: { filename: "bundle.crt" } # SPIRE 下发
require_client_certificate: true
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
route_config:
virtual_hosts:
- name: v1
require_tls: ALL
routes:
- match: { prefix: "/" }
route: { cluster: api }
http_filters:
- name: envoy.filters.http.jwt_authn # 验证 OIDC/JWT2)启动与探测
$ envoy -c envoy-zero-trust.yaml --disable-hot-restart &
[info] starting main dispatch loop
$ curl -vk --cert svid.pem --key svid.key https://gateway.internal:8443/health \
-H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..."
< HTTP/2 200
ok没有 SVID 或 JWT?就是 401/403。
kubectl 是把双刃剑,握法要讲究。
1)只读 kubeconfig(只给 get/list/watch)
$ kubectl create clusterrole readonly-basic --verb=get,list,watch --resource="*"
clusterrole.rbac.authorization.k8s.io/readonly-basic created
$ kubectl create rolebinding dev-ro --clusterrole=readonly-basic --user dev@corp
rolebinding.rbac.authorization.k8s.io/dev-ro created
$ KUBECONFIG=./kubeconfig-dev kubectl auth can-i delete pods -n prod
no
$ KUBECONFIG=./kubeconfig-dev kubectl get pods -n staging | head -n 2
NAME READY STATUS RESTARTS AGE
api-5dd6c8f8f9-abcde 1/1 Running 0 3h2)准入控制冻结 exec/port-forward(与 OPA/Gatekeeper 呼应)
$ kubectl -n prod exec deploy/api -- ls /
Error from server (Forbidden): admission webhook "gatekeeper.sh" denied the request: exec in prod is restricted太复杂的日志系统,事故时自己会出故障。先把主链跑通:本地缓冲 → 可靠队列 → WORM 存储。
1)Fluent Bit 把审计与系统日志堆到对象存储(片段)
# fluent-bit.conf(片段)
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _COMM=sshd
Systemd_Filter _COMM=sudo
[OUTPUT]
Name s3
bucket org-audit-logs
region ap-northeast-1
s3_key_format /%Y/%m/%d/%H/%M/$TAG
total_file_size 5M
upload_timeout 1m2)本地查看落盘与传输状态
$ sudo systemctl status fluent-bit --no-pager | sed -n '1,10p'
● fluent-bit.service - Fluent Bit
Active: active (running) since Tue 2025-08-19 10:01:00 UTC
...
$ ls -lh /var/fluent-bit/buffers | head -n 3
-rw------- 1 fluentbit fluentbit 5.0M Aug 19 10:05 host.sshd.00000001再严的制度,极端时刻也要能救火;但救火不等于放火。
1)一键发放“破窗令牌”,有效 10 分钟,自动写审计
$ vault write sys/leases/lookup/renew \
lease_id=approle/oncall/issue/root-ops ttl=10m
Key Value
--- -----
lease_id approle/oncall/issue/root-ops/7sY8...
lease_duration 10m
$ export SUDO_PROMPT="[oncall-7sY8] password for %p: "
$ sudo -k; sudo -n true || echo "MFA required. Use oncall token."
MFA required. Use oncall token.2)到点回收与广播
$ vault lease revoke -prefix approle/oncall/issue/root-ops/
All revocations queued for path prefix: approle/oncall/issue/root-ops/
$ wall <<<"[SEC] oncall window closed. Privileges revoked."
Broadcast message from root@host01 (pts/0) at 11:20 ...
[SEC] oncall window closed. Privileges revoked.把上面的配置都纳入 Git 仓库,走 MR/PR 审核、自动扫描与回滚。生产里最怕“诡异偏差”,只有声明式才能对齐。
1)策略与配置都走 GitOps
$ git add policy.rego cnp-api-mysql.yaml envoy-zero-trust.yaml
$ git commit -m "zero-trust: deny exec in prod; restrict api->mysql; envoy mTLS+JWT"
[main 4f21d6a] zero-trust: deny exec in prod; restrict api->mysql; envoy mTLS+JWT
3 files changed, 112 insertions(+)
$ git push origin main
To git.example.com:platform/zt-handbook.git
1b2c3d4..4f21d6a main -> main2)CI 里做个“最低线”检查
$ opa fmt -l policy.rego && opa test -v .
ok 1 - policy.rego::kubernetes.admission allow denies prod exec by default权限不是越多越方便,而是越准越稳定。零信任做得好不好,最终看“误伤率”和“可用性”:证书要短,但不能短到人还没查完就掉线;策略要紧,但不能把救火的人捆住。把每一个技术点做小做实:SSH 证书别搞成另一个“长钥匙串”;动态凭证一定要写回收;OPA 的规则从可观测的拒绝开始,日志信息说人话;eBPF 网络策略先框出主要路径,再一点一点收紧。没有哪一步是银弹,但把这些加起来,内部滥用会变成“成本极高且痕迹明显”的行为。
最后给一句“上手清单”,不喊口号,只留动作:
# 1. 打开 SSH 证书化、开启 TOTP、禁用长寿钥匙
sudo ssh-keygen -f /etc/ssh/ssh_ca -t ed25519 -N ''
echo "TrustedUserCAKeys /etc/ssh/ssh_ca.pub" | sudo tee -a /etc/ssh/sshd_config
echo "auth required pam_google_authenticator.so nullok" | sudo tee -a /etc/pam.d/sshd
sudo systemctl reload sshd
# 2. Vault 动态 DB 凭证
vault secrets enable database
vault write database/config/prod-mysql plugin_name=mysql-database-plugin ...
vault write database/roles/ro-analytics ... default_ttl=15m
# 3. SPIRE 下发 SVID(机器身份)
spire-server entry create ... ns/prod/sa/api
kubectl exec ... openssl x509 -in /run/spire/svids/default.pem -noout -dates
# 4. OPA/Gatekeeper 阻断危险动作
opa eval -i exec-prod.json -d policy.rego 'data.kubernetes.admission.allow'
# 5. Cilium 策略收紧横向移动
kubectl apply -f cnp-api-mysql.yaml
# 6. auditd + sudo I/O 留痕
sudo augenrules --load
sudo visudo -c零信任的重点就是: 不要轻易相信人 不要轻易相信机器 不要轻易相信一次登录 要靠权限、监控、审计这些手段配合起来,才能真的落地。 这样哪怕是内部人,也没法随意滥用系统。
这里我先声明一下,日常生活中大家都叫我波哥,跟辈分没关系,主要是岁数大了.就一个代称而已. 我的00后小同事我喊都是带哥的.张哥,李哥的. 但是这个称呼呀,在线下参加一些活动时.金主爸爸也这么叫就显的不太合适. 比如上次某集团策划总监,公司开大会来一句:“今个咱高兴!有请IT运维技术圈的波哥讲两句“ 这个氛围配这个称呼在互联网这行来讲就有点对不齐! 每次遇到这个情况我就想这么接话: “遇到各位是缘分,承蒙厚爱,啥也别说了,都在酒里了.我干了,你们随意!” 所以以后咱们改叫老杨,即市井又低调.还挺亲切,我觉得挺好.
运维X档案系列文章:
企业级 Kubernetes 集群安全加固全攻略( 附带一键检查脚本)
看完别走.修行在于点赞、转发、在看.攒今世之功德,修来世之福报
点击阅读原文或打开地址实时收集分析全球vps的项目 vps.top365app.com
老杨AI的号: 98dev