手把手部署 Vision Agents：从本地跑通到 K8s 生产上线，完整实操指南

原创

CoovallyAIHub

发布于 2026-03-06 16:15:43

1940

上一篇《实时视觉AI智能体框架来了！Vision Agents 狂揽7K Star》的关注度很高，说明大家对实时视频 AI Agent 这个方向确实感兴趣。了解了"它是什么"之后，自然的下一步就是：怎么把它跑起来？

这篇从环境准备、API 密钥申请，到本地运行第一个 Demo、Docker 打包、Kubernetes 生产部署——给出一份完整的部署参考。

项目地址：https://github.com/GetStream/Vision-Agents

上篇我们聊了 Vision Agents 的核心设计、模型生态和应用场景。还没看过的建议先读一下，了解这个框架在做什么。

这篇不再重复介绍框架本身，直接进入部署实操。内容结构如下：

第1~4节：环境搭建 → 跑通第一个 Demo → 启动 HTTP 服务
第5~10节：Docker 容器化 → K8s 部署 → 多节点扩展 → 监控 → 前端对接

前半段面向本地开发和体验，后半段面向生产环境部署。

环境准备

系统要求

如果只用云端模型（不跑本地 YOLO），一台普通笔记本即可。GPU 是可选项，不是必须。

安装 uv（Python 包管理器）

Vision Agents 推荐使用 uv 而非 pip，安装速度更快，依赖管理也更清晰。

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# 或通过 Homebrew
brew install uv
# 验证安装
uv --version

安装 Docker（容器部署需要，本地体验可跳过）

# macOS
brew install --cask docker
# Ubuntu
sudo apt-get update
sudo apt-get install docker.io docker-compose-plugin
sudo systemctl enable docker
sudo systemctl start docker

API 密钥申请

Vision Agents 依赖多个第三方 AI 服务，需要提前注册并获取对应的 API 密钥。不用全部注册，根据你选择的插件组合按需申请即可。

必须的 API 密钥

Stream 注册后开发者每月有 333,000 分钟免费额度，Gemini 同样提供免费调用额度，初期不需要付费。

可选 API 密钥（根据使用的插件选择）

推荐入门组合：Stream + Gemini + Deepgram + ElevenLabs，覆盖 STT → LLM → TTS 完整链路，四个服务均有免费额度。

环境变量配置

创建 .env 文件：

# ========== 必须配置 ==========
# Stream API 凭证
STREAM_API_KEY=your_stream_api_key_here
STREAM_API_SECRET=your_stream_api_secret_here
# Google Gemini API（LLM，二选一即可）
GOOGLE_API_KEY=your_google_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
# ========== 按需配置 ==========
# Deepgram（STT 语音识别）
DEEPGRAM_API_KEY=your_deepgram_api_key_here
# ElevenLabs（TTS 语音合成）
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
ELEVENLABS_VOICE_ID=voice_id_to_use
# OpenAI
OPENAI_API_KEY=your_openai_api_key_here
# Anthropic
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Cartesia
CARTESIA_API_KEY=your_cartesia_api_key_here
# Roboflow
ROBOFLOW_API_KEY=your_roboflow_api_key_here
ROBOFLOW_API_URL=your_roboflow_api_url_here

本地安装

方式一：从 PyPI 安装（想快速体验选这个）

# 创建项目目录
mkdir my-vision-agent && cd my-vision-agent
# 初始化项目
uv init
# 安装核心包
uv add vision-agents
# 安装需要的插件（按需选择）
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
# 或者单独安装插件
uv add vision-agents-plugins-gemini
uv add vision-agents-plugins-deepgram
uv add vision-agents-plugins-elevenlabs
uv add vision-agents-plugins-getstream

方式二：从源码安装（想深入研究或贡献代码选这个）

# 克隆仓库
git clone https://github.com/GetStream/Vision-Agents.git
cd Vision-Agents
# 创建 Python 虚拟环境
uv venv --python 3.12.11
# 安装所有依赖（包括所有插件和开发工具）
uv sync --all-extras --dev
# 安装 pre-commit 钩子
pre-commit install
# 配置环境变量
cp .env.example .env
# 编辑 .env 填入真实的 API 密钥

验证安装

# 运行简单示例验证
uv run examples/01_simple_agent_example/simple_agent_example.py run
# 运行测试（不含集成测试）
uv run py.test -m "not integration" -n auto

运行示例项目

简单语音 Agent（入门推荐）

最基础的语音对话 Agent：Deepgram 负责听 → Gemini 负责想 → ElevenLabs 负责说。

cd examples/01_simple_agent_example
uv run simple_agent_example.py run

运行后会自动打开浏览器 Demo UI，对着麦克风说话就能和 AI 对话。

自己写一个 Agent 的核心代码如下：

import asyncio
from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream
load_dotenv()
async def create_agent(**kwargs) -> Agent:
    llm = gemini.LLM("gemini-2.5-flash-lite")
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="My AI Assistant", id="agent"),
        instructions="你是一个友好的中文语音助手，回答要简短。",
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(eager_turn_detection=True),
    )
    return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("你好，有什么可以帮你的？")
        await agent.finish()
if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

使用视频文件作为输入


uv run examples/01_simple_agent_example/simple_agent_example.py run \
    --video-track-override path/to/your/video.mp4

高尔夫教练（视频 AI 示例）

上篇文章提到的 AI 高尔夫教练对应的就是这个示例，使用 YOLO 姿态检测 + Gemini Live 实时分析：


cd examples/02_golf_coach_example
uv run golf_coach_example.py run

电话 + RAG 示例

让 Agent 通过真实电话与人对话，还能从知识库检索信息（Twilio + TurboPuffer）：

cd examples/03_phone_and_rag_example
# 需要配置 Twilio 凭证
uv run phone_rag_example.py run

以上是本地开发和体验的完整流程。如果只是想了解框架能力，到这里就可以了。

下面进入生产部署部分，介绍如何将 Agent 部署为可对外服务的系统。

HTTP 服务模式

Vision Agents 内置了基于 FastAPI 的 HTTP 服务器，不需要额外写 Web 框架代码，通过 serve 命令即可将 Agent 暴露为带完整 REST API 的服务。

启动 HTTP 服务


# 使用任意示例启动 HTTP 服务
uv run examples/01_simple_agent_example/simple_agent_example.py serve \
    --host 0.0.0.0 \
    --port 8000

API 接口一览

启动后访问 http://localhost:8000/docs 可以看到自动生成的 Swagger 文档。主要接口：

启动会话：


curl -X POST http://localhost:8000/calls/my-call-123/sessions \
  -H "Content-Type: application/json" \
  -d '{"call_type": "default"}'

停止会话：

curl -X DELETE http://localhost:8000/calls/my-call-123/sessions/{session_id}

添加认证（生产环境必须）

默认的 API 没有鉴权，上线前需要添加认证。Vision Agents 通过 FastAPI 的依赖注入机制支持权限控制：

from fastapi import Depends, Header, HTTPException
from vision_agents.core import Runner, AgentLauncher, ServeOptions
async def verify_api_key(x_api_key: str = Header(...)):
    if x_api_key != "your-secret-key":
        raise HTTPException(status_code=401, detail="Invalid API key")
async def can_start(call_id: str, x_api_key: str = Header(...)):
    await verify_api_key(x_api_key)
runner = Runner(
    AgentLauncher(create_agent=create_agent, join_call=join_call),
    serve_options=ServeOptions(
        cors_allow_origins=["https://yourdomain.com"],
        can_start_session=can_start,
        can_close_session=can_start,
    ),
)

Docker 容器化

本地验证通过后，下一步是通过 Docker 将 Agent 打包为标准化的可部署镜像。

项目结构

为你的 Agent 创建以下文件结构：


my-agent/
├── agent.py            # 你的 Agent 代码
├── pyproject.toml      # 依赖配置
├── Dockerfile          # CPU 版 Docker 镜像
├── Dockerfile.gpu      # GPU 版 Docker 镜像
├── docker-compose.yml  # Docker Compose 配置
├── .env                # 环境变量（不要提交到 Git）
└── instructions.md     # Agent 指令（可选）

pyproject.toml


[project]
name = "my-vision-agent"
version = "0.1.0"
requires-python = ">=3.13"
dependencies = [
  "python-dotenv>=1.0",
  "vision-agents>=0.2.8",
  "vision-agents-plugins-getstream",
  "vision-agents-plugins-gemini",
  "vision-agents-plugins-deepgram",
  "vision-agents-plugins-elevenlabs",
]

Dockerfile（CPU 版，镜像约 150MB）

FROM python:3.13-slim
WORKDIR /app
# 安装 uv
RUN pip install uv
# 复制项目文件
COPY pyproject.toml uv.lock agent.py ./
COPY instructions.md ./  # 如果有的话
# 暴露端口
EXPOSE 8080
# 禁用硬链接警告
ENV UV_LINK_MODE=copy
# 启动时安装依赖并运行
CMD ["sh", "-c", "uv sync --frozen -v && uv run agent.py serve --host 0.0.0.0 --port 8080"]

Dockerfile.gpu（GPU 版，镜像约 8GB，需要 NVIDIA GPU）

FROM pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock agent.py ./
EXPOSE 8080
ENV UV_LINK_MODE=copy
CMD ["sh", "-c", "uv sync --frozen -v && uv run agent.py serve --host 0.0.0.0 --port 8080"]

docker-compose.yml

version: "3.8"
services:
  agent:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    env_file:
      - .env
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

构建和运行


# 构建镜像
docker compose build
# 启动服务
docker compose up -d
# 查看日志
docker compose logs -f
# 停止服务
docker compose down

单独使用Docker：


# 构建（指定 amd64 平台，适用于云服务器）
docker buildx build --platform linux/amd64 -t my-vision-agent .
# 运行
docker run -d \
  --name vision-agent \
  -p 8080:8080 \
  --env-file .env \
  my-vision-agent

Kubernetes 生产部署

Docker 解决了打包问题，但生产环境通常还需要自动重启、弹性扩缩、滚动更新、密钥管理等能力，这部分由 Kubernetes 来承担。

以下以 Nebius Cloud 为例（官方示例用的就是它），其他 K8s 云平台（AWS EKS、GCP GKE、Azure AKS、阿里云 ACK）步骤类似。

创建 Kubernetes 集群


# 安装 Nebius CLI
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
source ~/.zshrc
nebius profile create
# 查找网络信息
nebius vpc subnet list
nebius config list | grep parent-id
# 创建集群
nebius mk8s cluster create \
  --parent-id mk8scluster-<your-id> \
  --name vision-agents \
  --control-plane-subnet-id <subnet-id> \
  --control-plane-version 1.32 \
  --control-plane-endpoints-public-endpoint

添加节点

CPU 节点（测试/低成本，大多数场景够用）：

nebius mk8s node-group create \
  --parent-id <cluster-id> \
  --name cpu \
  --template-resources-platform cpu-d3 \
  --template-resources-preset 4vcpu-16gb \
  --template-boot-disk-size-gibibytes 64 \
  --template-service-account-id <service-account-id> \
  --fixed-node-count 1

GPU 节点（需要本地运行 YOLO/Whisper 等模型时选择，费用较高）：

nebius mk8s node-group create \
  --parent-id <cluster-id> \
  --name gpu \
  --template-resources-platform gpu-h200-sxm \
  --template-resources-preset 1gpu-16vcpu-200gb \
  --template-boot-disk-size-gibibytes 300 \
  --template-service-account-id <service-account-id> \
  --template-metadata-labels nebius.com/gpu=true \
  --template-gpu-settings-drivers-preset cuda12 \
  --fixed-node-count 1

获取 kubectl 凭证


nebius mk8s cluster get-credentials --id <cluster-id> --external --force
kubectl get nodes  # 验证连接

推送镜像到容器仓库

# 查找 registry ID
nebius registry list
# 标记并推送
docker tag my-vision-agent cr.eu-west1.nebius.cloud/<registry-id>/vision-agent:latest
docker push cr.eu-west1.nebius.cloud/<registry-id>/vision-agent:latest

创建 Kubernetes Secret（存放 API 密钥）

kubectl create secret generic vision-agent-env --from-env-file=.env

使用 Helm 部署


# 安装 Helm
brew install helm
# CPU 部署
helm upgrade --install vision-agent ./helm \
  --set image.repository="cr.eu-west1.nebius.cloud/<registry-id>/vision-agent" \
  --set image.tag=latest \
  --set image.pullPolicy=Always \
  --set gpu.enabled=false \
  --set secrets.existingSecret=vision-agent-env
# GPU 部署
helm upgrade --install vision-agent ./helm \
  --set image.repository="cr.eu-west1.nebius.cloud/<registry-id>/vision-agent" \
  --set image.tag=gpu \
  --set gpu.enabled=true \
  --set secrets.existingSecret=vision-agent-env

日常运维


# 查看日志
kubectl logs -l app.kubernetes.io/name=vision-agent -f --tail=100
# 更新密钥后重启
kubectl delete secret vision-agent-env
kubectl create secret generic vision-agent-env --from-env-file=.env
kubectl rollout restart deployment/vision-agent
# 暂停集群（节省费用）
nebius mk8s node-group update --id <node-group-id> --fixed-node-count 0 --async
# 恢复集群
nebius mk8s node-group update --id <node-group-id> --fixed-node-count 1 --async

多节点扩展

默认情况下 AgentLauncher 将会话信息存在内存中，仅支持单节点。如果需要在负载均衡器后部署多个实例，需要通过 Redis 实现会话状态共享。

启动 Redis

docker run -d --name redis -p 6379:6379 redis:7-alpine

代码配置

from vision_agents.core import AgentLauncher, Runner, SessionRegistry, RedisSessionKVStore
registry = SessionRegistry(
    store=RedisSessionKVStore(url="redis://localhost:6379/0"),
    node_id="node-1",  # 每个节点的唯一标识
    ttl=30.0,           # 心跳 TTL（秒）
)
runner = Runner(
    AgentLauncher(
        create_agent=create_agent,
        join_call=join_call,
        registry=registry,
    ),
)

配置完成后，任意节点都可以查看和关闭在其他节点上启动的会话。

Redis 配置参数

监控与可观测性

服务上线后需要持续关注运行状态：LLM 响应延迟是否正常、STT 识别是否稳定、Token 消耗是否合理——这些都需要监控来反馈。

Vision Agents 内置了 OpenTelemetry 集成，支持 Prometheus 指标采集和 Jaeger 链路追踪。

Prometheus 指标监控

安装依赖：

uv add opentelemetry-exporter-prometheus prometheus-client

代码配置：


from opentelemetry import metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server
resource = Resource.create({"service.name": "vision-agent"})
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))
# 在 9464 端口暴露指标
start_http_server(port=9464)

指标访问：http://localhost:9464/metrics

内置指标包括：

llm_latency_ms — LLM 响应延迟

llm_input_tokens / llm_output_tokens — Token 用量

stt_latency_ms — 语音识别延迟

tts_latency_ms — 语音合成延迟

Jaeger 链路追踪（可视化每个请求的完整调用链）

安装依赖：

uv add opentelemetry-sdk opentelemetry-exporter-otlp

启动 Jaeger：

docker run --rm -d \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 -p 4317:4317 -p 4318:4318 \
  jaegertracing/all-in-one:1.51

代码配置：

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
resource = Resource.create({"service.name": "vision-agent"})
tp = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True)
tp.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(tp)

Jaeger UI：http://localhost:16686

性能分析（找出代码中的性能瓶颈）


from vision_agents.core.profiling import Profiler
agent = Agent(
    # ... 其他配置 ...
    profiler=Profiler(output_path='./profile.html'),
)

运行结束后打开 profile.html 查看性能火焰图。

前端对接

后端 Agent 部署完成后，还需要前端界面供用户发起视频通话、与 Agent 交互。

Vision Agents 通过 Stream 的客户端 SDK 与前端通信，WebRTC 连接的建立和维护由 SDK 封装处理。

支持的前端 SDK

前端对接流程
前端使用 Stream SDK 创建/加入一个 Video Call
后端 Agent 通过 HTTP API 被触发加入同一个 Call
WebRTC 自动建立音视频连接
用户通过前端与 Agent 进行实时交互
React 前端示例

npm install @stream-io/video-react-sdk


// 使用 Stream Video SDK 创建通话
import { StreamVideoClient, StreamCall } from '@stream-io/video-react-sdk';
const client = new StreamVideoClient({
  apiKey: 'your_stream_api_key',
  user: { id: 'user-1', name: 'User' },
  token: 'user_token', // 从后端获取
});
const call = client.call('default', 'call-123');
await call.join({ create: true });
// 然后调用后端 API 让 Agent 加入
await fetch('http://your-server:8080/calls/call-123/sessions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ call_type: 'default' }),
});

常见问题与排查思路

Q1: 安装时出现依赖冲突


# 确保使用正确的 Python 版本
uv venv --python 3.12.11
uv sync --all-extras --dev

Q2: 音频延迟高或有杂音

排查顺序：

服务器区域：建议部署在 US-East（多数 AI 服务的默认区域），跨区域会显著增加延迟
开启 eager_turn_detection：deepgram.STT(eager_turn_detection=True) 可以明显降低响应延迟
换 Realtime 模式：使用 gemini.Realtime() 或 openai.Realtime() 可以跳过独立的 STT→LLM→TTS 三段链路，延迟更低

Q3: GPU Docker 镜像构建很慢

GPU 基础镜像约 8GB，构建耗时较长

在 Apple Silicon 上构建 amd64 镜像会更慢（需要模拟）

建议使用 CI/CD 在 x86 机器上构建

Q4: Agent 无法加入通话

检查 STREAM_API_KEY 和 STREAM_API_SECRET 是否正确

确保 Stream 账户有足够的额度

检查网络是否能访问 Stream 的 WebRTC 服务

Q5: LLM 响应太慢，对话体验差

实时语音场景对延迟非常敏感，优化思路：

选快模型：gemini-2.5-flash-lite 或 gpt-4o-mini，不要用重型推理模型
精简 instructions：指令越短，首 Token 出得越快
上 Realtime：gemini.Realtime() 或 openai.Realtime() 端到端延迟最低

Q6: 如何调试


# 启用 debug 模式
uv run examples/01_simple_agent_example/simple_agent_example.py run --debug

Q7: 视频 AI 的已知限制

小文本识别困难（如游戏比分），容易产生幻觉

长时间视频（>30秒）会导致模型丢失上下文

图像尺寸和帧率需要保持较低以确保性能

视频不会主动触发 Realtime 模型响应，需要发送音频/文本来触发

Q8: 代码质量检查


# 运行 ruff 格式化
uv run ruff check --fix
# 运行 mypy 类型检查
uv run mypy --install-types --non-interactive -p vision_agents
# 运行完整检查
uv run python dev.py check