GPT From Scratch – Transformer Experiments

This repository follows the nanoGPT-style implementation of a GPT model from scratch in PyTorch, with progressively advanced modifications across exercises. It’s based on the educational series by Andrej Karpathy and enhanced with state-of-the-art techniques in later stages.

Exercise Summary

`Ex1.py`: Vanilla GPT with Head and MultiHeadAttention combined into one class

Goal: Implement a basic GPT model trained on TinyShakespeare.
Features:
- Token + Positional Embeddings
- Causal Masking with Triangular Matrix
- Multi-Head Attention
- FeedForward block
- CrossEntropy Loss with teacher forcing

`Ex2.py`: Addition GPT

Goal: Train a GPT model to perform integer addition (e.g., 123+456=579).
Modifications:
- Custom dataset generator for random digit-based addition problems.
- Target sequence is the reversed sum digits, mimicking human addition.
- Used y = -1 masking to ignore loss on prompt (input) tokens.
- Inference via .generate() + postprocessing ([::-1]).
Bug Fixes & Learnings:
- Fixed shape mismatch issues in position embeddings and masking.
- Switched to deterministic sampling during generation.
- Verified learning using integer prediction accuracy.

`Ex3.py`: Pretraining on Large Dataset and Finetuning on TinY Shakespeare

Chose large dataset like openwebtext
Tokenize the data using the same vocabulary/tokenizer as Shakespeare
Train for many steps and save model checkpoint with:

learning_rate = 3e-4
max_iters = 100_000 or more
eval_interval = 1000
torch.save(model.state_dict(), 'pretrained_gpt.pt')

load thepretrained weights and finetune on tiny shakespeare
Lower validation loss after pretraining

`Ex4.py`: Transformer Improvements

Goal: Enhance the vanilla GPT with modern architectural improvements.
Implemented Features:

1. Multi-Query Attention (MQA)
- What: Shared key and value projections across all heads.
- Why: Reduces memory usage and compute time without degrading performance.
- Where: Replaced MultiHeadAttention with MultiQueryAttention.
2. SwiGLU Activation in FeedForward
- What: Replaces ReLU with SwiGLU in the feed-forward layer.
- Why: Empirically shown to improve training dynamics and downstream performance.
- Implementation:
```
nn.Sequential(
    nn.Linear(n_embd, 2 * n_embd),
    nn.SiLU(),  # Swish
    nn.Linear(n_embd, n_embd),
    nn.Dropout(dropout)
)
```

Ideas for Future Experiments

Implement RoPE (Rotary Positional Embeddings)
Use FlashAttention
Add residual scaling (rescale_layer)
Switch to GELU approximation (QuickGELU)
Add LoRA for low-rank adaptation
Integrate Chain-of-Thought tracing

Dependencies

Python 3.8+
PyTorch 2.x
CUDA-enabled GPU (optional but recommended)

Acknowledgements

Inspired by karpathy/nanoGPT and research from recent transformer architecture papers (PaLM, LLaMA, FlashAttention, etc.).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Ex1.py		Ex1.py
Ex2.py		Ex2.py
Ex4.py		Ex4.py
ReadME.md		ReadME.md
ex2.txt		ex2.txt
gpt.py		gpt.py
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT From Scratch – Transformer Experiments

Exercise Summary

`Ex1.py`: Vanilla GPT with Head and MultiHeadAttention combined into one class

`Ex2.py`: Addition GPT

`Ex3.py`: Pretraining on Large Dataset and Finetuning on TinY Shakespeare

`Ex4.py`: Transformer Improvements

1. Multi-Query Attention (MQA)

2. SwiGLU Activation in FeedForward

Ideas for Future Experiments

Dependencies

Acknowledgements

About

Uh oh!

Releases

Languages

Kash6/GPTFromScratch

Folders and files

Latest commit

History

Repository files navigation

GPT From Scratch – Transformer Experiments

Exercise Summary

Ex1.py: Vanilla GPT with Head and MultiHeadAttention combined into one class

Ex2.py: Addition GPT

Ex3.py: Pretraining on Large Dataset and Finetuning on TinY Shakespeare

Ex4.py: Transformer Improvements

1. Multi-Query Attention (MQA)

2. SwiGLU Activation in FeedForward

Ideas for Future Experiments

Dependencies

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages

`Ex1.py`: Vanilla GPT with Head and MultiHeadAttention combined into one class

`Ex2.py`: Addition GPT

`Ex3.py`: Pretraining on Large Dataset and Finetuning on TinY Shakespeare

`Ex4.py`: Transformer Improvements