A rewrite of minGPT that prioritizes teeth over education. Reproduces GPT-2 (124M) on OpenWebText
running on a single 8XA100 40GB node in about 4 days of training. Plain and readable code:
train.py is ~300 lines, model.py is ~300 lines.
nanoGPT is the simplest, fastest repository for training and finetuning medium-sized GPTs.
It is a clean rewrite of minGPT
that prioritizes production-readiness over educational scaffolding. The code is plain and readable:
train.py is a ~300-line boilerplate training loop and model.py is a ~300-line GPT model definition,
which can optionally load GPT-2 weights from OpenAI.
Two core files — train.py and model.py — each around 300 lines.
No framework lock-in. Readable, hackable, transparent.
python train.py config/train_shakespeare_char.py
Trains GPT-2 (124M) on OpenWebText to ~2.85 loss in ~4 days on an 8XA100 node. Supports multi-node distributed training via PyTorch DDP.
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
Load OpenAI GPT-2 checkpoints (124M to 1.5B) with a single flag. Finetune on custom data with a smaller learning rate.
python train.py config/finetune_shakespeare.py
Uses torch.compile() for significant speedups. Cuts iteration time from
~250ms to ~135ms per iteration. Works on CPU, GPU, and Apple MPS.
python train.py --compile=False # fallback
From raw data to a trained GPT model. nanoGPT keeps the pipeline small and transparent so each step is easy to understand and modify.
Download and tokenize a dataset into train.bin and val.bin.
Works with Shakespeare, OpenWebText, or your own text.
Set architecture, hyperparameters, and training budget in a config file. Block size, layers, heads, embedding size — all configurable.
Run the training loop. On a GPU, full GPT-2 reproduction takes ~4 days. On a CPU or MacBook, train a character-level GPT in ~3 minutes.
Generate text from the trained model. Prompt with any text and watch the model complete the sequence in its learned style.
Dependencies are minimal. Install with pip.
pip install torch numpy transformers datasets tiktoken wandb tqdm
Dependencies: PyTorch, numpy, HuggingFace transformers (to load GPT-2 checkpoints), HuggingFace datasets (for OpenWebText), tiktoken (OpenAI BPE), wandb (optional logging), tqdm (progress bars).
Train a character-level GPT on Shakespeare in minutes.
Turn raw Shakespeare text into a stream of integers.
python data/shakespeare_char/prepare.py
Baby GPT with 256 context, 384 channels, 6 layers, 6 heads. Takes ~3 minutes on A100.
python train.py config/train_shakespeare_char.py
Smaller model with --device=cpu --compile=False. Still runs in ~3 minutes.
python train.py config/train_shakespeare_char.py --device=cpu --compile=False --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000
Generate text from the best checkpoint. Works on CPU or GPU.
python sample.py --out_dir=out-shakespeare-char
For deep learning professionals. Tokenize OpenWebText, then train with PyTorch DDP across multiple GPUs.
python data/openwebtext/prepare.py
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
# On master node (example IP 123.456.123.456): torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \ --master_addr=123.456.123.456 --master_port=1234 train.py # On worker node: torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \ --master_addr=123.456.123.456 --master_port=1234 train.py
Runs for ~4 days and reaches loss of ~2.85. If you don't have Infiniband,
prepend NCCL_IB_DISABLE=1 to the launch commands.
OpenAI GPT-2 checkpoints evaluated on OpenWebText. Run each eval config to observe train and val losses.
python train.py config/eval_gpt2.py python train.py config/eval_gpt2_medium.py python train.py config/eval_gpt2_large.py python train.py config/eval_gpt2_xl.py
| Model | Params | Train Loss | Val Loss |
|---|---|---|---|
gpt2 |
124M | 3.11 | 3.12 |
gpt2-medium |
350M | 2.85 | 2.84 |
gpt2-large |
774M | 2.66 | 2.67 |
gpt2-xl |
1558M | 2.56 | 2.54 |
Note: GPT-2 was trained on WebText (closed). OpenWebText is a best-effort open reproduction. There is a domain gap. Finetuning GPT-2 (124M) on OWT reaches ~2.85 loss.
Finetune a pretrained GPT-2 model on your own text. Initialize from a checkpoint and train with a smaller learning rate.
Download the tiny Shakespeare dataset and tokenize with the OpenAI BPE tokenizer.
python data/shakespeare/prepare.py
Loads a GPT-2 checkpoint via init_from and trains with a small learning rate.
Completes in minutes on a single GPU.
python train.py config/finetune_shakespeare.py
Available pretrained models:
gpt2, gpt2-medium, gpt2-large, gpt2-xl.
If you run out of memory, try a smaller model or decrease block_size.
Sample from pretrained GPT-2 models or from a model you trained yourself.
python sample.py \
--init_from=gpt2-xl \
--start="What is the answer to life, the universe, and everything?" \
--num_samples=5 --max_new_tokens=100
python sample.py --out_dir=out-shakespeare
You can also prompt the model with text from a file: python sample.py --start=FILE:prompt.txt.
For benchmarking and profiling, bench.py provides the core training loop
without the extras. The default setup uses PyTorch 2.0 torch.compile()
for significant speed improvements.
python bench.py
With PyTorch 2.0 nightly (Dec 2022), torch.compile() cuts iteration time
from ~250ms/iter to 135ms/iter. If you run into errors, disable compile with
--compile=False.
Common questions about nanoGPT — the creator, the dependencies, what makes it different.
train.py (~300 lines). The model definition is in
model.py (~300 lines). Both files are designed to be readable and easy to hack.
--device=mps for 2-3x acceleration.
--init_from=gpt2 (or gpt2-medium, gpt2-large,
gpt2-xl) in the training or sampling script. The model code can optionally
load GPT-2 weights from OpenAI's released checkpoints via the HuggingFace transformers library.