Skip to content

VincentPit/AutoEncoder_VQA

Repository files navigation

AutoEncoder VQA: Visual Question Answering with Multimodal Architecture

πŸ“‹ Overview

This project implements a sophisticated Visual Question Answering (VQA) system that combines BERT for text understanding and Vision Transformer (ViT) with Masked Autoencoder (MAE) for visual processing. The model uses cross-attention mechanisms to fuse multimodal information and generate accurate answers to questions about images.

πŸ—οΈ Architecture

The system consists of several key components:

  • Text Encoder: Pre-trained BERT model for question understanding
  • Visual Encoder: Vision Transformer with MAE pre-training for image feature extraction
  • Cross-Attention Layers: Bidirectional attention mechanisms for multimodal fusion
  • Decoder: Transformer decoder with positional embeddings for answer generation
  • Data Loaders: COCO-VQA dataset handling with efficient preprocessing

πŸ“ Project Structure

AutoEncoder_VQA/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ config/
β”‚   └── config.yaml             # Configuration settings
β”œβ”€β”€ dataloaders/                # Data loading and preprocessing
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ coco_dataloader.py      # COCO-VQA dataset loader
β”‚   β”œβ”€β”€ dataloader.py           # Base dataset classes
β”‚   └── mscoco_dataloader.py    # MS-COCO specific loader
β”œβ”€β”€ models/                     # Model architectures
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ co_decoder_posi_v4_2.py # Latest multimodal model
β”‚   β”œβ”€β”€ positional_embedding.py # Positional encoding utilities
β”‚   β”œβ”€β”€ cross_attention_model.py # Cross-attention implementations
β”‚   └── [other model variants]
β”œβ”€β”€ visual_embed/               # Visual encoding components
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py               # MAE encoder wrapper
β”‚   β”œβ”€β”€ models_mae.py           # MAE implementation
β”‚   └── util/                   # Utility functions
β”œβ”€β”€ question_embed/             # Text encoding components
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── pretrained_bert.py      # BERT utilities
β”œβ”€β”€ trainings/                  # Training scripts
β”‚   β”œβ”€β”€ train_decoder_posi_v4_2.py # Main training script
β”‚   β”œβ”€β”€ train.py                # Basic training
β”‚   └── [other training variants]
β”œβ”€β”€ results/                    # Output results
β”œβ”€β”€ scripts/                    # Utility scripts
└── tests/                      # Unit tests

πŸš€ Getting Started

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended)
  • 16GB+ RAM

Installation

  1. Clone the repository:
git clone https://github.com/VincentPit/AutoEncoder_VQA.git
cd AutoEncoder_VQA
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Download pre-trained models and datasets:
python scripts/download_pretrained.py
python scripts/download_dataset.py

Quick Start

  1. Training a model:
python trainings/train_decoder_posi_v4_2.py --config config/config.yaml
  1. Evaluating a model:
python scripts/evaluate.py --model_path checkpoints/best_model.pth
  1. Interactive inference:
python scripts/interactive.py --model_path checkpoints/best_model.pth

πŸ”§ Configuration

Main configuration parameters in config/config.yaml:

model:
  max_seq_length: 512
  dropout_rate: 0.1
  num_attention_heads: 8
  
training:
  batch_size: 16
  learning_rate: 1e-5
  num_epochs: 50
  warmup_steps: 1000
  
data:
  train_images: "train2014/"
  val_images: "val2014/"
  annotations: "v2_mscoco_train2014_annotations.json"
  questions: "v2_OpenEnded_mscoco_train2014_questions.json"

πŸ“Š Model Performance

Model Version BLEU-4 CIDEr Accuracy
v4.2 (Latest) 0.234 0.891 67.3%
v4.0 0.221 0.867 65.1%
v3.0 0.198 0.834 62.8%

πŸ§ͺ Experiments

Model Variants

  1. co_decoder_posi_v4_2: Latest model with improved cross-attention and positional embeddings
  2. cross_attention_model: Baseline cross-attention architecture
  3. transfer_cross_attention: Transfer learning approach

Training Strategies

  • Frozen Encoders: BERT and ViT parameters frozen during training
  • Mixed Precision: FP16 training for memory efficiency
  • Gradient Accumulation: Effective batch size scaling
  • Learning Rate Scheduling: Cosine annealing with warmup

πŸ“ˆ Results and Analysis

Qualitative Results

The model shows strong performance on:

  • Object identification and counting
  • Spatial relationship understanding
  • Color and attribute recognition
  • Scene description

Limitations

  • Complex reasoning about multiple objects
  • Abstract concept understanding
  • Numerical calculations

πŸ› οΈ Development

Code Style

This project follows PEP 8 guidelines with additional conventions:

  • Use type hints where possible
  • Comprehensive docstrings for all classes and functions
  • Modular design with clear separation of concerns

Testing

Run tests with:

python -m pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“š References

  1. Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.
  2. He, K., et al. "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022.
  3. Antol, S., et al. "VQA: Visual Question Answering." ICCV 2015.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Hugging Face Transformers library
  • PyTorch team
  • COCO dataset creators
  • MAE authors for pre-trained models

πŸ“§ Contact


Note: This project is actively maintained. Please check the Issues tab for known problems and the Projects tab for upcoming features.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages