
最近在研究计算机视觉中的时序行为分析技术,对YOLOv8与Transformer的融合架构产生了浓厚兴趣。本文记录了我在学习过程中的技术思考和实践探索,希望能与大家交流讨论。
在计算机视觉领域,单帧目标检测已经相对成熟,但很多实际场景需要理解行为的时序特性。比如:
传统的单帧检测方法无法捕捉这些时序信息,这就需要引入时序建模技术。
YOLOv8作为目标检测的代表性算法,有几个值得关注的技术特点:
相比YOLOv3/v4的Anchor-Based设计,YOLOv8采用Anchor-Free方案:
1# 查看YOLOv8检测头结构
2from ultralytics import YOLO
3
4model = YOLO('yolov8n.pt')
5print(model.model.model[-1])输出显示检测头直接预测边界框的中心点坐标和宽高,而不是预测相对于Anchor的偏移量。这种设计简化了超参数调优,但对小目标检测提出了新的挑战。
YOLOv8使用PANet(Path Aggregation Network)进行多尺度特征融合:
1# 可视化不同尺度的特征图
2import torch
3import cv2
4
5def get_feature_maps(model, img_path):
6 img = cv2.imread(img_path)
7 img_tensor = torch.from_numpy(img).permute(2, 0, 1).float().unsqueeze(0) / 255.0
8
9 # 前向传播获取中间特征
10 features = []
11 def hook_fn(module, input, output):
12 features.append(output)
13
14 # 注册hook
15 hooks = []
16 for i, layer in enumerate(model.model.model):
17 if hasattr(layer, 'f'):
18 hook = layer.register_forward_hook(hook_fn)
19 hooks.append(hook)
20
21 with torch.no_grad():
22 _ = model(img_tensor)
23
24 # 移除hook
25 for hook in hooks:
26 hook.remove()
27
28 return features通过实验发现,高层语义特征和低层细节特征的融合对检测效果影响很大。
Transformer最初用于NLP,处理的是离散的token序列。在视觉时序分析中,我们需要为每一帧添加位置信息:
1import torch
2import torch.nn as nn
3
4class PositionalEncoding(nn.Module):
5 def __init__(self, d_model, dropout=0.1, max_len=5000):
6 super(PositionalEncoding, self).__init__()
7 self.dropout = nn.Dropout(p=dropout)
8
9 pe = torch.zeros(max_len, d_model)
10 position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
11 div_term = torch.exp(torch.arange(0, d_model, 2).float() *
12 (-torch.log(torch.tensor(10000.0)) / d_model))
13 pe[:, 0::2] = torch.sin(position * div_term)
14 pe[:, 1::2] = torch.cos(position * div_term)
15 pe = pe.unsqueeze(0)
16 self.register_buffer('pe', pe)
17
18 def forward(self, x):
19 # x: [batch_size, seq_len, d_model]
20 x = x + self.pe[:, :x.size(1), :]
21 return self.dropout(x)1class TemporalTransformer(nn.Module):
2 def __init__(self, input_dim=512, nhead=8, num_layers=3, dropout=0.1):
3 super(TemporalTransformer, self).__init__()
4
5 self.pos_encoder = PositionalEncoding(input_dim, dropout)
6
7 encoder_layer = nn.TransformerEncoderLayer(
8 d_model=input_dim,
9 nhead=nhead,
10 dim_feedforward=2048,
11 dropout=dropout,
12 batch_first=True
13 )
14 self.transformer_encoder = nn.TransformerEncoder(
15 encoder_layer,
16 num_layers=num_layers
17 )
18
19 self.classifier = nn.Linear(input_dim, 2) # 二分类
20
21 def forward(self, x):
22 # x: [batch_size, seq_len, input_dim]
23 x = self.pos_encoder(x)
24 x = self.transformer_encoder(x)
25 x = x[:, -1, :] # 取最后一个时间步
26 output = self.classifier(x)
27 return output1import numpy as np
2from collections import deque
3
4class ObjectTracker:
5 def __init__(self, max_history=30):
6 self.tracks = {}
7 self.track_id_counter = 0
8 self.max_history = max_history
9
10 def update(self, detections, frame_id):
11 """
12 detections: list of {'bbox': [x1, y1, x2, y2], 'class': int, 'conf': float}
13 """
14 new_tracks = {}
15
16 for det in detections:
17 bbox = det['bbox']
18 matched = False
19
20 # 简单的IoU匹配
21 for track_id, track in self.tracks.items():
22 if self._iou(bbox, track['bbox']) > 0.5:
23 track['history'].append({
24 'bbox': bbox,
25 'frame_id': frame_id,
26 'timestamp': time.time()
27 })
28 new_tracks[track_id] = track
29 matched = True
30 break
31
32 # 新目标
33 if not matched:
34 self.track_id_counter += 1
35 new_tracks[self.track_id_counter] = {
36 'bbox': bbox,
37 'class': det['class'],
38 'history': deque([{
39 'bbox': bbox,
40 'frame_id': frame_id,
41 'timestamp': time.time()
42 }], maxlen=self.max_history)
43 }
44
45 self.tracks = new_tracks
46 return self.tracks
47
48 def _iou(self, box1, box2):
49 x1 = max(box1[0], box2[0])
50 y1 = max(box1[1], box2[1])
51 x2 = min(box1[2], box2[2])
52 y2 = min(box1[3], box2[3])
53
54 intersection = max(0, x2 - x1) * max(0, y2 - y1)
55 area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
56 area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
57 union = area1 + area2 - intersection
58
59 return intersection / union if union > 0 else 01def build_temporal_features(tracks, seq_len=16):
2 """
3 从轨迹历史构建时序特征
4 """
5 temporal_sequences = []
6 track_ids = []
7
8 for track_id, track in tracks.items():
9 history = list(track['history'])
10 if len(history) < seq_len:
11 continue
12
13 # 提取最近seq_len帧的特征
14 sequence = []
15 for i in range(-seq_len, 0):
16 bbox = history[i]['bbox']
17 width = bbox[2] - bbox[0]
18 height = bbox[3] - bbox[1]
19 area = width * height
20
21 # 计算运动特征
22 if i > -seq_len:
23 prev_bbox = history[i-1]['bbox']
24 dx = (bbox[0] + bbox[2]) / 2 - (prev_bbox[0] + prev_bbox[2]) / 2
25 dy = (bbox[1] + bbox[3]) / 2 - (prev_bbox[1] + prev_bbox[3]) / 2
26 velocity = np.sqrt(dx**2 + dy**2)
27 else:
28 velocity = 0
29
30 # 特征向量
31 feature = [width, height, area, velocity]
32 sequence.append(feature)
33
34 temporal_sequences.append(sequence)
35 track_ids.append(track_id)
36
37 return np.array(temporal_sequences), track_ids我在自建的小规模数据集上进行了对比实验:
1# 对比不同方法的性能
2methods = ['YOLOv8_only', 'YOLOv8+LSTM', 'YOLOv8+Transformer']
3results = {}
4
5for method in methods:
6 # 加载对应模型
7 model = load_model(method)
8
9 # 评估
10 metrics = evaluate(model, test_dataset)
11 results[method] = metrics
12
13 print(f"{method}: Acc={metrics['acc']:.3f}, F1={metrics['f1']:.3f}")在我的测试集上(个人实验数据):
表格
方法 | 准确率 | F1分数 | 推理延迟 |
|---|---|---|---|
YOLOv8 | 0.721 | 0.683 | 45ms |
YOLOv8+LSTM | 0.785 | 0.752 | 68ms |
YOLOv8+Transformer | 0.812 | 0.789 | 72ms |
可以看出,引入时序建模后性能有明显提升,Transformer相比LSTM在准确率上略有优势。
为了验证各组件的作用,我进行了消融实验:
1# 消融实验配置
2ablation_configs = [
3 {'use_positional_encoding': False, 'use_transformer': False},
4 {'use_positional_encoding': True, 'use_transformer': False},
5 {'use_positional_encoding': True, 'use_transformer': True},
6]
7
8for config in ablation_configs:
9 model = build_model(config)
10 metrics = evaluate(model, test_dataset)
11 print(f"Config {config}: Acc={metrics['acc']:.3f}")结果表明,位置编码和Transformer编码器都对性能有贡献(个人实验数据)。
Transformer的计算复杂度是O(n²),处理长序列时开销很大。我尝试了几种优化方法:
1# 方法1:局部注意力
2class LocalAttention(nn.Module):
3 def __init__(self, window_size=8):
4 super().__init__()
5 self.window_size = window_size
6
7 def forward(self, x):
8 # 只计算局部窗口内的注意力
9 batch_size, seq_len, dim = x.shape
10 output = torch.zeros_like(x)
11
12 for i in range(seq_len):
13 start = max(0, i - self.window_size // 2)
14 end = min(seq_len, i + self.window_size // 2 + 1)
15 window = x[:, start:end, :]
16 # 计算局部注意力...
17
18 return output
19
20# 方法2:分层处理
21class HierarchicalTransformer(nn.Module):
22 def __init__(self, input_dim, nhead):
23 super().__init__()
24 # 先在小窗口内处理,再全局聚合
25 self.local_transformer = TransformerEncoderLayer(input_dim, nhead)
26 self.global_transformer = TransformerEncoderLayer(input_dim, nhead)时序行为数据往往存在严重的类别不平衡。我尝试了以下方法:
1# Focal Loss
2class FocalLoss(nn.Module):
3 def __init__(self, alpha=0.25, gamma=2.0):
4 super().__init__()
5 self.alpha = alpha
6 self.gamma = gamma
7
8 def forward(self, inputs, targets):
9 BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
10 pt = torch.exp(-BCE_loss)
11 F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss
12 return F_loss.mean()
13
14# 类别权重
15class_weights = torch.tensor([1.0, 5.0]) # 少数类权重更高
16criterion = nn.CrossEntropyLoss(weight=class_weights)通过这段时间的学习和实践,我有几点体会:
希望这些技术探索和实践经验对大家有帮助。欢迎在评论区交流讨论!
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。