Image Gallery

Click on an image to read more about it.

Project Overview

PyTorch GPT-2 clone based on Andrej Karpathy's material. Implements manual scaled dot-product attention and PyTorch flash attention (scaled_dot_product_attention) with causal masking. Uses GELU, LayerNorm pre-norm, residual connections, and tied input/output embeddings.

Attention

Two backends: manual attention with explicit QK^T/√d and softmax, and flash attention via PyTorch for speed and memory efficiency. Autoregressive mask is a lower-triangular buffer of shape (1,1,T,T).

Initialization and Optimizer

Weights follow GPT-2 init (Normal(0, 0.02); position embeddings 0.01). Residual branches scale down by 1/√(2·n_layer) via a flag on residual-adjacent projections. AdamW uses decoupled weight decay on weight matrices only; biases and LayerNorm weights are excluded.

Pretraining and Data

Supports token/position embeddings for a vocab of 50,257 and context length 1024. Trains on FineWeb-Edu 10B. Uses gradient checkpointing and optional DDP for multi-GPU. Checkpoints store optimizer and model state for pausing and resuming training.

HF Compatibility

Includes a loader to map Hugging Face GPT-2 weights into this model, transposing QKV and MLP projection matrices as needed and skipping masked bias buffers.

Sampling & Eval

Greedy or temperature sampling over logits (B,T,V). Perplexity evaluation via average negative log-likelihood. Flash attention can be toggled per block for ablations.

Tokenizer Reimplementation

To better understand GPT-2's preprocessing pipeline, I reimplemented the byte pair encoding (BPE) tokenizer from scratch in a separate repository: llm-tokenizer. It reproduces GPT-2's 50,257-token vocabulary, byte-level encoding, and BPE merges. The tokenizer works independently of Hugging Face and outputs exact token IDs for any input text.

×