Andrej Karpathy · MIT License

The simplest, fastest repo for training GPTs

A rewrite of minGPT that prioritizes teeth over education. Reproduces GPT-2 (124M) on OpenWebText running on a single 8XA100 40GB node in about 4 days of training. Plain and readable code: train.py is ~300 lines, model.py is ~300 lines.

58.9k
GitHub Stars
10.1k
Forks
124M
GPT-2 Params
~4 days
Training Time (8xA100)

What is nanoGPT?

nanoGPT is the simplest, fastest repository for training and finetuning medium-sized GPTs. It is a clean rewrite of minGPT that prioritizes production-readiness over educational scaffolding. The code is plain and readable: train.py is a ~300-line boilerplate training loop and model.py is a ~300-line GPT model definition, which can optionally load GPT-2 weights from OpenAI.

01

Simple by design

Two core files — train.py and model.py — each around 300 lines. No framework lock-in. Readable, hackable, transparent.

python train.py config/train_shakespeare_char.py
02

Reproduces GPT-2

Trains GPT-2 (124M) on OpenWebText to ~2.85 loss in ~4 days on an 8XA100 node. Supports multi-node distributed training via PyTorch DDP.

torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
03

Pretrained checkpoint loading

Load OpenAI GPT-2 checkpoints (124M to 1.5B) with a single flag. Finetune on custom data with a smaller learning rate.

python train.py config/finetune_shakespeare.py
04

PyTorch 2.0 compiled

Uses torch.compile() for significant speedups. Cuts iteration time from ~250ms to ~135ms per iteration. Works on CPU, GPU, and Apple MPS.

python train.py --compile=False # fallback

The training loop, explained

From raw data to a trained GPT model. nanoGPT keeps the pipeline small and transparent so each step is easy to understand and modify.

1

Prepare data

Download and tokenize a dataset into train.bin and val.bin. Works with Shakespeare, OpenWebText, or your own text.

2

Configure model

Set architecture, hyperparameters, and training budget in a config file. Block size, layers, heads, embedding size — all configurable.

3

Train

Run the training loop. On a GPU, full GPT-2 reproduction takes ~4 days. On a CPU or MacBook, train a character-level GPT in ~3 minutes.

4

Sample

Generate text from the trained model. Prompt with any text and watch the model complete the sequence in its learned style.

Install

Dependencies are minimal. Install with pip.

Terminal pip
pip install torch numpy transformers datasets tiktoken wandb tqdm

Dependencies: PyTorch, numpy, HuggingFace transformers (to load GPT-2 checkpoints), HuggingFace datasets (for OpenWebText), tiktoken (OpenAI BPE), wandb (optional logging), tqdm (progress bars).

Quick Start

Train a character-level GPT on Shakespeare in minutes.

1

Download and prepare data

Turn raw Shakespeare text into a stream of integers.

python data/shakespeare_char/prepare.py
2

Train on a GPU

Baby GPT with 256 context, 384 channels, 6 layers, 6 heads. Takes ~3 minutes on A100.

python train.py config/train_shakespeare_char.py
3

Train on a CPU / MacBook

Smaller model with --device=cpu --compile=False. Still runs in ~3 minutes.

python train.py config/train_shakespeare_char.py --device=cpu --compile=False --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000
4

Sample from the model

Generate text from the best checkpoint. Works on CPU or GPU.

python sample.py --out_dir=out-shakespeare-char

Reproducing GPT-2

For deep learning professionals. Tokenize OpenWebText, then train with PyTorch DDP across multiple GPUs.

Data preparation OpenWebText
python data/openwebtext/prepare.py
Train GPT-2 (124M) on 8xA100 DDP
torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
Multi-node training 2 nodes
# On master node (example IP 123.456.123.456):
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
  --master_addr=123.456.123.456 --master_port=1234 train.py

# On worker node:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
  --master_addr=123.456.123.456 --master_port=1234 train.py

Runs for ~4 days and reaches loss of ~2.85. If you don't have Infiniband, prepend NCCL_IB_DISABLE=1 to the launch commands.

Baselines

OpenAI GPT-2 checkpoints evaluated on OpenWebText. Run each eval config to observe train and val losses.

Run all baselines
python train.py config/eval_gpt2.py
python train.py config/eval_gpt2_medium.py
python train.py config/eval_gpt2_large.py
python train.py config/eval_gpt2_xl.py
Model Params Train Loss Val Loss
gpt2 124M 3.11 3.12
gpt2-medium 350M 2.85 2.84
gpt2-large 774M 2.66 2.67
gpt2-xl 1558M 2.56 2.54

Note: GPT-2 was trained on WebText (closed). OpenWebText is a best-effort open reproduction. There is a domain gap. Finetuning GPT-2 (124M) on OWT reaches ~2.85 loss.

Finetuning

Finetune a pretrained GPT-2 model on your own text. Initialize from a checkpoint and train with a smaller learning rate.

1

Prepare custom text

Download the tiny Shakespeare dataset and tokenize with the OpenAI BPE tokenizer.

python data/shakespeare/prepare.py
2

Run finetuning

Loads a GPT-2 checkpoint via init_from and trains with a small learning rate. Completes in minutes on a single GPU.

python train.py config/finetune_shakespeare.py

Available pretrained models: gpt2, gpt2-medium, gpt2-large, gpt2-xl. If you run out of memory, try a smaller model or decrease block_size.

Sampling / Inference

Sample from pretrained GPT-2 models or from a model you trained yourself.

Sample from GPT-2 XL
python sample.py \
    --init_from=gpt2-xl \
    --start="What is the answer to life, the universe, and everything?" \
    --num_samples=5 --max_new_tokens=100
Sample from your own model
python sample.py --out_dir=out-shakespeare

You can also prompt the model with text from a file: python sample.py --start=FILE:prompt.txt.

Efficiency Notes

For benchmarking and profiling, bench.py provides the core training loop without the extras. The default setup uses PyTorch 2.0 torch.compile() for significant speed improvements.

Benchmark
python bench.py

With PyTorch 2.0 nightly (Dec 2022), torch.compile() cuts iteration time from ~250ms/iter to 135ms/iter. If you run into errors, disable compile with --compile=False.

FAQ

Common questions about nanoGPT — the creator, the dependencies, what makes it different.

nanoGPT comes from Andrej Karpathy. It is a rewrite of his earlier educational project minGPT, trading pedagogical clarity for production-readiness and performance.
The training loop is in train.py (~300 lines). The model definition is in model.py (~300 lines). Both files are designed to be readable and easy to hack.
For the full GPT-2 (124M) reproduction, you need at least an 8XA100 40GB node. For smaller experiments, a single GPU or even a CPU/MacBook works — you can train a character-level GPT on Shakespeare in ~3 minutes on a MacBook. Apple Silicon users should add --device=mps for 2-3x acceleration.
minGPT was designed for education — clean, minimal code to teach how GPTs work. nanoGPT prioritizes teeth over education: it is faster, supports distributed training, and can reproduce GPT-2 scale models. The code is still readable and hackable, but optimized for real use.
Use --init_from=gpt2 (or gpt2-medium, gpt2-large, gpt2-xl) in the training or sampling script. The model code can optionally load GPT-2 weights from OpenAI's released checkpoints via the HuggingFace transformers library.
nanoGPT has a newer cousin called nanochat (Nov 2025). nanoGPT is now considered old and deprecated but is left up for posterity. The code still works and remains a popular reference implementation.