GPT-2 Clone From Sratch | Daniel Di Giovanni

Image Gallery

Click on an image to read more about it.

❮

GPT-2 in Action

This is my implementation of GPT-2 (loaded with the official weights from HuggingFace) running inference on the prefix: "Hello, I'm a language model,". It produces intelligible output, though not fully coherent, as it is a relatively small model.

Mathematical text of matrix multiplications.

Calculating Self-Attention via Matrix Multiplication

This is some rough work I did to understand how matrix operations can implemented functionality that would otherwise be realized through loops. By using matrix multiplication and GPUs (or SIMD processors) we can significantly speed up execution. I then used this when implementing masked self-attention in the transformer heads.

Graph with four plots, each slightly different, two of which are sinusoidal.

Position Embedding Values for Random Columns

This graph shows the position embedding values (from 0 to 1024, which is the sequence length) for four randomly-chosen columns in the GPT-2 embedding matrix. Interesting patterns are observed, though they are difficult to make sense of.

A pixelated gray square, with some areas being very coarse and others being a smooth single shade of gray.

Visualization of a Random Weight Matrix

This staticky-looking image is a visualization of one of the weight matrices in GPT-2. Specifically, it is first 300 rows and columns of the first self-attention head. Lighter shades mean higher values. Notice how there are some patterns that can be observed (some areas are more smooth/less coarse). Similar to the positional embeddings graph in the previous image, it is difficult to decipher this information.

❯

Project Overview

PyTorch GPT-2 clone based on Andrej Karpathy's material. Implements manual scaled dot-product attention and PyTorch flash attention (scaled_dot_product_attention) with causal masking. Uses GELU, LayerNorm pre-norm, residual connections, and tied input/output embeddings.

Attention

Two backends: manual attention with explicit QK^T/√d and softmax, and flash attention via PyTorch for speed and memory efficiency. Autoregressive mask is a lower-triangular buffer of shape (1,1,T,T).

Initialization and Optimizer

Weights follow GPT-2 init (Normal(0, 0.02); position embeddings 0.01). Residual branches scale down by 1/√(2·n_layer) via a flag on residual-adjacent projections. AdamW uses decoupled weight decay on weight matrices only; biases and LayerNorm weights are excluded.

Pretraining and Data

Supports token/position embeddings for a vocab of 50,257 and context length 1024. Trains on FineWeb-Edu 10B. Uses gradient checkpointing and optional DDP for multi-GPU. Checkpoints store optimizer and model state for pausing and resuming training.

HF Compatibility

Includes a loader to map Hugging Face GPT-2 weights into this model, transposing QKV and MLP projection matrices as needed and skipping masked bias buffers.

Sampling & Eval

Greedy or temperature sampling over logits (B,T,V). Perplexity evaluation via average negative log-likelihood. Flash attention can be toggled per block for ablations.

Tokenizer Reimplementation

To better understand GPT-2's preprocessing pipeline, I reimplemented the byte pair encoding (BPE) tokenizer from scratch in a separate repository: llm-tokenizer. It reproduces GPT-2's 50,257-token vocabulary, byte-level encoding, and BPE merges. The tokenizer works independently of Hugging Face and outputs exact token IDs for any input text.