SmolGPT - GPT From Scratch
Modern decoder-only transformer (RoPE, RMSNorm, SwiGLU, flash attention) built from scratch as an optimisation sandbox.
Why it mattersA foundation for research into small, locally-hosted reasoning models - architectures and training recipes you can take apart and modify, instead of treating an API as a black box.
What it does
A decoder-only GPT, written from scratch in PyTorch, used as a sandbox for model-optimisation experiments. RoPE positional encoding, RMSNorm, SwiGLU feed-forward blocks, flash attention via scaled_dot_product_attention, a weight-tied lm_head, and a BPE tokenizer trained on the same corpus - intentionally close to a modern LLaMA-flavoured transformer, but small enough to fit on a single consumer GPU.
Where it applies
- Personal research into how small a reliable reasoning agent can be - the foundation for a locally-hosted model with behavioural guarantees you cannot extract from a black-box API.
- Architecture and training-recipe ablations: swap an attention mechanism, change a precision regime, add a quantisation pass, and inspect every layer of the change.
- A reading reference for anyone learning a modern transformer top-to-bottom in inspectable Python.
How it works (high level)
Standard transformer stack with both a clean one-head-at-a-time attention (for reading) and a fused-QKV multi-head attention with optional flash kernels (for training). Memory-mapped sharded dataset reader streams FineWeb-Edu shards on the fly. The trainer logs more than the headline loss - gradient norm, parameter norm, update-to-parameter ratio, logit entropy, tokens/sec, GPU peak memory, and a loss-vs-tokens scaling trace - so each run is interpretable in TensorBoard.
Outcome
A clean training stack I can take apart and modify, end to end. Stage one (current) is a correct, modern transformer training pipeline; stage two is the optimisation work; stage three is specialising the model for reliable narrow-domain reasoning that runs entirely on local hardware.
Stack
Python · PyTorch · TensorBoard · custom BPE tokenizer · FineWeb-Edu shards.