This project implements a sophisticated Visual Question Answering (VQA) system that combines BERT for text understanding and Vision Transformer (ViT) with Masked Autoencoder (MAE) for visual processing. The model uses cross-attention mechanisms to fuse multimodal information and generate accurate answers to questions about images.
The system consists of several key components:
- Text Encoder: Pre-trained BERT model for question understanding
- Visual Encoder: Vision Transformer with MAE pre-training for image feature extraction
- Cross-Attention Layers: Bidirectional attention mechanisms for multimodal fusion
- Decoder: Transformer decoder with positional embeddings for answer generation
- Data Loaders: COCO-VQA dataset handling with efficient preprocessing
AutoEncoder_VQA/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ config/
β βββ config.yaml # Configuration settings
βββ dataloaders/ # Data loading and preprocessing
β βββ __init__.py
β βββ coco_dataloader.py # COCO-VQA dataset loader
β βββ dataloader.py # Base dataset classes
β βββ mscoco_dataloader.py # MS-COCO specific loader
βββ models/ # Model architectures
β βββ __init__.py
β βββ co_decoder_posi_v4_2.py # Latest multimodal model
β βββ positional_embedding.py # Positional encoding utilities
β βββ cross_attention_model.py # Cross-attention implementations
β βββ [other model variants]
βββ visual_embed/ # Visual encoding components
β βββ __init__.py
β βββ models.py # MAE encoder wrapper
β βββ models_mae.py # MAE implementation
β βββ util/ # Utility functions
βββ question_embed/ # Text encoding components
β βββ __init__.py
β βββ pretrained_bert.py # BERT utilities
βββ trainings/ # Training scripts
β βββ train_decoder_posi_v4_2.py # Main training script
β βββ train.py # Basic training
β βββ [other training variants]
βββ results/ # Output results
βββ scripts/ # Utility scripts
βββ tests/ # Unit tests
- Python 3.8+
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Clone the repository:
git clone https://github.com/VincentPit/AutoEncoder_VQA.git
cd AutoEncoder_VQA- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Download pre-trained models and datasets:
python scripts/download_pretrained.py
python scripts/download_dataset.py- Training a model:
python trainings/train_decoder_posi_v4_2.py --config config/config.yaml- Evaluating a model:
python scripts/evaluate.py --model_path checkpoints/best_model.pth- Interactive inference:
python scripts/interactive.py --model_path checkpoints/best_model.pthMain configuration parameters in config/config.yaml:
model:
max_seq_length: 512
dropout_rate: 0.1
num_attention_heads: 8
training:
batch_size: 16
learning_rate: 1e-5
num_epochs: 50
warmup_steps: 1000
data:
train_images: "train2014/"
val_images: "val2014/"
annotations: "v2_mscoco_train2014_annotations.json"
questions: "v2_OpenEnded_mscoco_train2014_questions.json"| Model Version | BLEU-4 | CIDEr | Accuracy |
|---|---|---|---|
| v4.2 (Latest) | 0.234 | 0.891 | 67.3% |
| v4.0 | 0.221 | 0.867 | 65.1% |
| v3.0 | 0.198 | 0.834 | 62.8% |
- co_decoder_posi_v4_2: Latest model with improved cross-attention and positional embeddings
- cross_attention_model: Baseline cross-attention architecture
- transfer_cross_attention: Transfer learning approach
- Frozen Encoders: BERT and ViT parameters frozen during training
- Mixed Precision: FP16 training for memory efficiency
- Gradient Accumulation: Effective batch size scaling
- Learning Rate Scheduling: Cosine annealing with warmup
The model shows strong performance on:
- Object identification and counting
- Spatial relationship understanding
- Color and attribute recognition
- Scene description
- Complex reasoning about multiple objects
- Abstract concept understanding
- Numerical calculations
This project follows PEP 8 guidelines with additional conventions:
- Use type hints where possible
- Comprehensive docstrings for all classes and functions
- Modular design with clear separation of concerns
Run tests with:
python -m pytest tests/ -v- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.
- He, K., et al. "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022.
- Antol, S., et al. "VQA: Visual Question Answering." ICCV 2015.
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face Transformers library
- PyTorch team
- COCO dataset creators
- MAE authors for pre-trained models
- Author: Vincent Pit
- Email: vincent.pit@example.com
- Project Link: https://github.com/VincentPit/AutoEncoder_VQA
Note: This project is actively maintained. Please check the Issues tab for known problems and the Projects tab for upcoming features.