Cross-Attention Visual-Inertial Odometry (CAVIO) is a PyTorch Lightning and Hydra codebase for training latent-space visual-inertial odometry models on KITTI. The project investigates whether replacing VIFT-style feature concatenation with structured cross-attention improves visual/IMU fusion for camera pose estimation.
VIFT combines visual features from a pretrained encoder with IMU features, then passes the concatenated representation through a transformer pose model. CAVIO keeps the same general training and latent-data workflow, but changes the fusion mechanism: IMU latents query visual latents through cross-attention, followed by causal self-attention over the fused temporal sequence.
The project includes the main CAVIO transformer, several ablations, configurable pose losses, and a KITTI evaluation harness for trajectory and odometry metrics.
For more background and experimental details, see the CIS4910 literature review and final report.
- Built
CAVIOPoseTransformer, a cross-attention VIO model where IMU features query visual features before causal temporal self-attention. - Added ablation models for IMU-only, gated cross-attention, and visual-residual fusion variants.
- Refactored VIFT-derived training, loss, and evaluation code to reduce duplication and improve readability, maintainability, and reuse.
- Extended plotting scripts and loss metrics to include learning rate and translation/rotation component losses.
- Organized experiments with Hydra presets for baseline, architecture-size, dropout, loss-weighting, and ablation runs.
The strongest CAVIO configuration used a 512-dimensional transformer embedding, 1024-dimensional feed-forward layers, 8 attention heads, and a rotation loss weight of 25. In the final report, this configuration improved selected sequence-level KITTI metrics compared with the reproduced VIFT baseline while remaining competitive overall.
The experiments also showed that vertical trajectory estimation remained difficult: top-down motion was captured more reliably than the y-axis component. The IMU-only ablation performed substantially worse, confirming that visual features contributed meaningful signal even when fusion quality was the main bottleneck.
src/models/components/cavio.py: main cross-attention transformer architecturesrc/models/components/: CAVIO ablations and VIFT-compatible componentssrc/models/weighted_vio_module.py: Lightning module for training and evaluationsrc/losses/weighted_loss.py: weighted pose losses and RPMG-based objectivessrc/metrics/kitti_metrics.py: KITTI odometry metric utilitiessrc/testers/: latent KITTI evaluation harness and runnersrc/utils/plotting/: training-loss and trajectory plotting utilitiesconfigs/: Hydra configuration groups and experiment presetsscripts/: setup, debug, and batch experiment helpers
From the CAVIO directory:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
./scripts/setup.shThe main training config is configs/train.yaml, which composes:
data: latent_kitti_viomodel: caviologger: many_loggerstrainer: default
Use trainer=gpu on the CLI for GPU training. Evaluation config is configs/eval.yaml.
Run training for a specific experiment:
python src/train.py experiment=cavio_baseline trainer=gpuEvaluate a checkpoint for a specific experiment:
python src/eval.py experiment=cross_attn_d512_ff1024 trainer=gpu ckpt_path=/path/to/checkpoint.ckptHydra writes each run to a unique output directory using configs/hydra/default.yaml and configs/paths/default.yaml.
Artifacts include:
- model checkpoints
- CSV logs
- TensorBoard logs
error_metrics.json: final test metricsplots/:loss_plot.png: train/validation loss plotcomponent_loss_plot.png: rotation and translation loss plottrajectories.png: KITTI trajectory plots with top-down path and vertical trajectory comparison
This repository is heavily influenced by the VIFT repository design and workflow. Many project patterns, including the training/evaluation structure, configuration style, latent-data flow, and parts of the VIO pipeline, follow that prior codebase and are adapted here for CAVIO experiments.
VIFT repository: https://github.com/ybkurt/vift
Development used Cursor for AI-assisted refactoring, documentation, and tooling. Model design, experiments, and analysis are the author's own.
- Keep CSV logging enabled if you modify the logger layout; loss plotting expects
metrics.csv.