从0开始训练自己的LLM（3）

golangLeetcode

发布于 2026-03-18 18:41:26

870

Transformer架构核心是输入输出编码器、多头注意力机制和前馈神经网络，前面介绍了编码器和注意力机制，本文通过前馈神经网络，将两者串联起来，实现一个完整的GPT模型。前馈神经网络（Feedforward Neural Network, FNN）是神经网络中最基础的结构，数据从输入层出发，经过隐藏层的处理，最终到达输出层，整个过程没有反馈循环。它解决了各层维度不一致的问题，实现输入到输出的拟合，下面我们隐藏注意力机制等复杂结构实现一个没有内容的GPT模型，一个GPT模型的配置如下

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

模型会根据这些参数构建神经网络，首先是编码层，包括了内容本身的编码和内容位置编码，两者叠加，然后是随机的Dropout防止过拟合。接着经过Transformer层的处理，通常包含 6 到 12 个 Block（如 GPT-2 有 12 层），每层包含自注意力机制、前馈神经网络和残差连接。然后正则化，对模型输出的最终特征进行归一化处理，通过调整输入分布（均值为0，方差为1）加速收敛，提升模型训练稳定性。最后将 Transformer 模块的最终特征（维度 emb_dim）映射到词汇表大小（vocab_size），输出每个词的概率分布（logits）。有了这个概率我们就可以进行文本生成。

import torch
import torch.nn as nn

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        # Use a placeholder for LayerNorm
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # A simple placeholder

    def forward(self, x):
        # This block does nothing and just returns its input.
        return x


class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface.

    def forward(self, x):
        # This layer does nothing and just returns its input.
        return x

有了模型输出的参数，我们就可以用这个模型来生成文本，模型算出概率后，我们提取最后一个词的概率，然后通过softmax进行归一化处理，最后通过argmax找到概率最大的位置，有了概率最大的位置，我们就可以到词典反查，得到输出

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):

        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]

        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)

        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

我们使用模型进行词的生成

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

model.eval() # disable dropout

out = generate_text_simple(
    model=model,
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

输出如下：

encoded: [15496, 11, 314, 716]

encoded_tensor.shape: torch.Size([1, 4])

Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])

Output length: 10

Hello, I am Featureiman Byeswickattribute argue

至此我们完成了，模型构建到文本预测的过程。但是其中有一个问题还没有解决，那就是如何训练模型，得到模型参数。我们下一章进行分解。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2026-01-16，如有侵权请联系 cloudcommunity@tencent.com 删除

编码

本文分享自 golang算法架构leetcode技术php 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

从0开始训练自己的LLM（3）

从0开始训练自己的LLM（3）

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐