About the Paper
"Attention Is All You Need" (Vaswani et al., 2017) introduced the Transformer architecture, revolutionizing natural language processing and becoming the foundation for modern AI systems like GPT, BERT, and beyond. The paper demonstrated that attention mechanisms alone, without recurrence or convolution, could achieve state-of-the-art results in machine translation.
This project implements the complete Transformer architecture from scratch in PyTorch, faithfully following the paper's specifications for educational and research purposes.
Implementation Highlights
Complete Architecture
Full encoder-decoder implementation with multi-head attention, positional encoding, and feed-forward networks.
Paper Faithful
Follows the original paper's specifications with default hyperparameters (d_model=512, N=6, h=8).
Training Pipeline
Complete training infrastructure with validation, checkpointing, and TensorBoard visualization.
Bilingual Translation
Trained on English-French dataset using OPUS Books corpus for machine translation tasks.
Pure PyTorch
Clean, readable implementation using only PyTorch without external transformer libraries.
Evaluation Metrics
Includes BLEU, CER, and WER metrics for comprehensive translation quality assessment.
Architecture Overview
The Transformer architecture consists of an encoder-decoder structure with the following key components:
| Component | Value | Description |
|---|---|---|
| Model Dimension | 512 | Embedding and hidden state dimension |
| Encoder/Decoder Layers | 6 each | Stacked layers for processing |
| Attention Heads | 8 | Multi-head attention mechanism |
| Feed-Forward Dimension | 2048 | Inner layer dimension in FFN |
| Dropout Rate | 0.1 | Regularization parameter |
Key Features
- Multi-Head Self-Attention: Allows the model to attend to different representation subspaces simultaneously
- Positional Encoding: Injects sequence order information using sine and cosine functions
- Masked Attention: Prevents the decoder from attending to future positions during training
- Layer Normalization: Applied before each sub-layer for training stability
- Residual Connections: Facilitates gradient flow through deep networks
Resources
Original Paper: Attention Is All You Need (Vaswani et al., NeurIPS 2017)
Source Code: GitHub Repository - Complete implementation with training scripts and documentation
Citation
@article{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and
Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and
Kaiser, {\L}ukasz and Polosukhin, Illia},
journal={Advances in neural information processing systems},
volume={30},
year={2017}
}