Variational Bayesian PCA (Ilin and Raiko, 2010) with support for missing data, sparse masks, optional bias terms, and an orthogonal post-rotation to a PCA basis. The implementation follows the original MATLAB reference while adding Python-native APIs, fast C++ extensions for heavy routines, and runtime autotuning for thread/accessor/covariance writeback modes.
Missing values are common in scientific and industrial tabular datasets, but many analysis pipelines either impute first (which can mask uncertainty and affect downstream inference) or drop incomplete samples/features. This pattern is widespread across PCA, matrix factorization, and related dimensionality-reduction workflows. VBPCApy provides a practical Variational Bayesian PCA implementation that models missingness directly and exposes posterior uncertainty outputs (for example, marginal variances and covariance-derived diagnostics) alongside reconstructions. Relative to common impute-then-analyze workflows, this enables uncertainty-aware latent-factor analysis in a single reproducible Python API.
This package targets researchers and practitioners who need:
- robust latent-factor modeling with incomplete observations,
- explicit uncertainty terms (posterior covariances, marginal variances), and
- a Python API suitable for reproducible pipelines while retaining a parity path to legacy workflows.
VBPCApy implements the Ilin & Raiko (2010) VB-PCA formulation with modern Python ergonomics. The default compat_mode="strict_legacy" preserves historical behavior; compat_mode="modern" is available for updated semantics in selected preprocessing/masking cases.
The package uses optimized c++ kernels.
- Default behavior: dense/sparse/noise/rotate kernels are selected automatically from data and mask structure.
- Build requirement: extension modules must be available in the installed package (source build or wheel).
- Behavioral compatibility:
compat_modecontrols numerical compatibility semantics (strict_legacyvsmodern) and is independent of kernel dispatch. - Thread control: thread counts can be constrained with environment variables (for example
VBPCA_NUM_THREADS, and operation-specific overrides used by some kernels).
In short: backend selection affects runtime, not model semantics.
- Variational Bayesian PCA on dense or sparse data with explicit missing-entry masks.
- Optional bias (per-feature mean) estimation and rotation to a PCA-aligned solution.
- Support for shared observation patterns to reuse factorizations and speed inference.
- Posterior covariances for scores and loadings; probe-set RMS for held-out validation.
- Direct access to reconstructions and per-entry marginal variances from the sklearn-like
VBPCAestimator. - C++ extensions via pybind11 for performance-critical routines; runtime autotune selects thread counts, buffered accessors, and covariance writeback modes.
- Missing-aware preprocessing utilities (one-hot encode, standardize, min-max, auto-routing) that preserve NaNs/masks for generative reconstruction.
VBPCAsklearn-like wrapper (fit/transform/inverse_transform) with mask support.- Empirical risk minimization based model selector for number of PCs which best reconstruct the empirical data.
Stable user-facing APIs are imported from vbpca_py:
from vbpca_py import (
VBPCA,
select_n_components,
SelectionConfig,
AutoEncoder,
MissingAwareOneHotEncoder,
MissingAwareSparseOneHotEncoder,
MissingAwareStandardScaler,
MissingAwareMinMaxScaler,
)Preprocessing interfaces are first-class public APIs:
MissingAwareOneHotEncoder: categorical encoding with NaN/mask-aware behavior.MissingAwareSparseOneHotEncoder: sparse categorical encoder (CSR input, numeric categories) that preserves sparsity end-to-end and round-trips via sparse inverse transform.MissingAwareStandardScaler/MissingAwareMinMaxScaler: continuous scaling on observed entries.AutoEncoder: mixed-type routing layer withfit/transform/inverse_transform.
- Python: >= 3.11
- C++ Compiler: C++14 compatible compiler (gcc, clang, MSVC)
- Eigen: Linear algebra library (version 3.x)
- Matplotlib (optional): install via
pip install vbpca_py[plot]for monitoring displays and plotting utilities
Ubuntu/Debian:
sudo apt-get install libeigen3-devmacOS (Homebrew):
brew install eigenConda/Mamba:
conda install -c conda-forge eigenManual Installation:
Download from eigen.tuxfamily.org and set the EIGEN_INCLUDE_DIR environment variable:
export EIGEN_INCLUDE_DIR=/path/to/eigen3Eigen is located automatically via EIGEN_INCLUDE_DIR, $CONDA_PREFIX/include/eigen3, /opt/homebrew/include/eigen3, /usr/include/eigen3, or /usr/local/include/eigen3.
From source:
git clone https://github.com/yoavram-lab/VBPCApy.git
cd VBPCApy
pip install .With optional dependencies:
# Development tools (pytest, ruff, mypy, just)
pip install .[dev]
# Plotting utilities (matplotlib)
pip install .[plot]
# Optional data utilities (pandas)
pip install .[data]
# Benchmark + plotting stack (joblib, pandas, scikit-learn, seaborn)
pip install .[benchmark]
# Optional Octave bridge (only needed to run MATLAB/Octave helpers/tests)
pip install .[octave]
# Install everything
pip install .[dev,plot,data,benchmark,octave]Using uv (recommended for Python env management):
# Sync core developer environment
uv sync --extra dev --extra data
# Include benchmark + plotting dependencies
uv sync --extra dev --extra data --extra benchmark
# Include optional Octave Python bridge packages
uv sync --extra dev --extra data --extra benchmark --extra octaveimport numpy as np
from vbpca_py import VBPCA, AutoEncoder
# 50 features, 200 samples
x = np.random.randn(50, 200)
# Optional mask (1 = observed, 0 = missing); omit for fully observed data
mask = np.ones_like(x)
model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(x, mask=mask)
recon = model.inverse_transform()
# Access reconstruction + marginal variance directly from the estimator
recon = model.reconstruction_
var = model.variance_
# Inspect the resolved options (defaults + your overrides)
print(model.get_options())
# Learning-curve summaries
print("RMS", model.rms_)
print("Probe RMS", model.prms_)
print("Final cost", model.cost_)
# Missing-aware preprocessing pipeline (categorical + continuous)
auto = AutoEncoder(cardinality_threshold=10, continuous_scaler="standard")
z = auto.fit_transform(x, mask=mask)
model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(z, mask=np.ones_like(z, dtype=bool))
z_recon = model.inverse_transform()
x_recon = auto.inverse_transform(z_recon)import scipy.sparse as sp
from vbpca_py import VBPCA
# CSR input with explicit stored zeros where observed
x_sparse = sp.csr_matrix([[1.0, 0.0], [0.0, 2.0]])
# Mask for sparse must match the sparsity pattern (spones(X)); omit to infer from X
mask = x_sparse.copy()
mask.data[:] = 1.0
model = VBPCA(n_components=2, maxiters=100)
scores = model.fit_transform(x_sparse, mask=mask)- Sparse inputs must be CSR/CSC; they remain sparse throughout computation.
- For sparse data, the observation set is the stored entries of
X(including stored zeros); if you pass a mask it must matchspones(X)exactly. - For dense data, pass a dense mask of 0/1 with the same shape; a mask is required for dense inputs when any missingness exists.
transform()only returns training scores; it does not accept new data. Usefit_transformon the training set.- To encode wide numeric categorical columns sparsely, use
MissingAwareSparseOneHotEncoder(one column at a time, CSR input) and keep the maskNone.
mask/pattern_index: handle missing entries and reuse observation patterns.bias: toggle mean estimation;init: control initial factors.probe: pass probe data/masks to monitor held-out RMS during fitting.maxiters,tol,verbose: convergence control and logging.rotation: final orthogonal rotation to a PCA-aligned solution.compat_mode: compatibility policy for sparse empty-row/column handling (strict_legacydefault,modernavailable).runtime_tuning/num_cpu: runtime policy and threading controls.runtime_tuning="safe"(default) runs a short probe to pick threads, accessor mode (legacyvsbuffered), and covariance writeback (python,bulk,kernel) based on measured speed.runtime_tuning="aggressive"tries a wider search. Per-kernel env vars includeVBPCA_NUM_THREADS,VBPCA_SCORE_THREADS,VBPCA_LOADINGS_THREADS,VBPCA_NOISE_THREADS,VBPCA_RMS_THREADS.auto_pattern_masked: when true, reuse dense mask patterns even withuniquesv=0to reduce repeated per-column score covariance work (default off for parity).
- Default runtime behavior uses
runtime_tuning="safe"to measure and choosenum_cpu, accessor mode, and covariance writeback mode; you can still pinnum_cpuexplicitly or override with env vars. runtime_tuning="aggressive"expands the search if you want maximum throughput and can tolerate a slightly longer probe.- Fast sweep preset: use
runtime_tuning="safe",SelectionConfig(compute_explained_variance=False, patience=2, max_trials=5), and cap the k sweep to a modest window (e.g., 25–45 for tall/wide matrices). - The cultural replay script exposes
--fast-mode,--runtime-tuning, and--num-cputo apply these defaults without code changes.
- Stable public imports are those re-exported from
vbpca_pyin src/vbpca_py/init.py. - Modules and symbols prefixed with
_are internal implementation details and may change without deprecation. - For forward compatibility, prefer
from vbpca_py import ...over importing from internal modules.
Each fit (including every k tried in select_n_components) runs the PCA_FULL EM loop until one of these criteria triggers or maxiters is reached:
- Subspace angle below
minangle(default1e-8). - Probe RMS increase when
earlystopis truthy. - RMS plateau via
rmsstop = [window, abs_tol, rel_tol](default[100, 1e-4, 1e-3], enabled). Meaning: compare the latest RMS to the valuewindowiterations ago; stop if the absolute change is <abs_tolor, when finite, the relative change is <rel_tol. - Cost plateau via
cfstop = [window, abs_tol, rel_tol](default[], disabled). Same interpretation asrmsstopbut on cost. - Slowing-down guard when internal backtracking hits 40 steps.
- Hard cap
maxiters(default 1000).
Notes:
niter_broadprior(default 100) suppresses stopping messages while the broad-prior warmup runs whenuse_prioris on.- All options are case-insensitive and passed through the
VBPCAconstructor (or forwarded byselect_n_components). - Reference implementation lives in src/vbpca_py/_pca_full.py and src/vbpca_py/_converge.py.
All public APIs can be imported directly from vbpca_py:
from vbpca_py import (
VBPCA,
select_n_components,
SelectionConfig,
AutoEncoder,
MissingAwareOneHotEncoder,
MissingAwareStandardScaler,
MissingAwareMinMaxScaler,
)Core estimator: VBPCA(n_components, bias=True, maxiters=None, tol=None, verbose=0, **opts) with fit, transform, fit_transform, inverse_transform, learned attributes (components_, scores_, mean_, rms_, prms_, cost_, variance_, reconstruction_, explained_variance_, explained_variance_ratio_), and get_options() to inspect merged defaults.
Model selection: select_n_components(x, *, mask=None, components=None, config=None, **opts) sweeps ks and returns (best_k, best_metrics, trace, best_model). components defaults to 1..min(n_features, n_samples). SelectionConfig(metric="prms"|"rms"|"cost", patience=None, max_trials=None, compute_explained_variance=True, return_best_model=False) controls sweep stopping and retention. **opts flow through to VBPCA/pca_full (e.g., maxiters, minangle, rmsstop, cfstop, earlystop, rotate2pca). trace holds per-k endpoint metrics; best_metrics is the winning entry.
Preprocessing: missing-aware encoders and scalers
MissingAwareOneHotEncoder: categorical OHE respecting masks/NaNs;handle_unknown="ignore"|"raise", optional mean-centering.MissingAwareStandardScalerandMissingAwareMinMaxScaler: continuous scaling while ignoring masked entries.AutoEncoder: column-wise router that applies the above per column withcardinality_threshold(integer columns with uniques <= threshold are treated as categorical),continuous_scaler("standard"or"minmax"),handle_unknown(ignore vs raise unseen categories),mean_center_ohe(center one-hot columns), optionalcolumn_typesoverride (force categorical/continuous per column), andfit/transform/inverse_transformfor round-tripping mixed data with masks.
All options are consumed via the VBPCA estimator. Call model.get_options() after construction to view the merged defaults and your overrides. The canonical reference list lives in src/vbpca_py/_pca_full.py. See src/vbpca_py/estimators.py, src/vbpca_py/model_selection.py, and src/vbpca_py/preprocessing.py for the stable public APIs.
The package includes missing-aware preprocessing and an autoencoder-style inverse transform to map back to the original feature space:
from vbpca_py import AutoEncoder, VBPCA
auto = AutoEncoder(cardinality_threshold=10, continuous_scaler="standard")
z = auto.fit_transform(x, mask=mask) # encodes continuous + categorical
model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(z, mask=np.ones_like(z, dtype=bool))
z_recon = model.inverse_transform()
x_recon = auto.inverse_transform(z_recon) # decode back to original space| Scenario | Format | Why |
|---|---|---|
| High-dimensional data with structural zeros (genomics count matrices, one-hot-encoded surveys) | Sparse CSR/CSC | The CSR sparsity pattern acts as an implicit observation mask; sparse kernels avoid materialising the full matrix. |
| Moderate dimensions with random missingness (NaN-masked tabular data) | Dense + explicit mask | Pass a boolean mask of observed entries; dense kernels benefit from BLAS. |
Key API difference:
- Sparse: observation mask is inferred from stored entries —
mask = spones(X). - Dense: observation mask must be provided explicitly —
mask = ~np.isnan(X).
Install the optional plotting extra:
pip install vbpca_py[plot]
# or
uv sync --extra plotThree convenience functions are provided:
from vbpca_py.plotting import scree_plot, loadings_barplot, variance_explained_plot
model = VBPCA(n_components=5, maxiters=100)
model.fit(x, mask=mask)
scree_plot(model, cumulative=True) # explained variance per component
loadings_barplot(model, component=0, top_n=10) # feature importance for one PC
variance_explained_plot(model) # absolute variance per componentThe project includes several benchmarking recipes via just:
| Recipe | Description |
|---|---|
just bench |
Kernel-level timing via pytest-benchmark |
just bench-scale |
Timing across increasing matrix sizes |
just bench-octave |
Python vs Octave comparison (requires octave and uv sync --extra octave) |
just bench-save / just bench-compare |
Save and compare baselines |
just bench-study-repro |
Validate deterministic reproducibility |
For Octave benchmarks, install Octave first:
# Ubuntu/Debian
sudo apt-get install octave octave-dev
# macOS
brew install octaveThen sync the Octave extra: uv sync --extra octave.
transform(new_data)is not implemented. Only training scores are returned. To project new data, refit on the combined dataset.inverse_transform()always returns dense output, even when the input was sparse CSR/CSC.MissingAwareSparseOneHotEncoderrequires numeric categories. String categories cannot survive the CSR round-trip.- Data convention:
AutoEncoderexpects samples × features;VBPCAexpects features × samples. Transpose as needed.
Install in developer mode:
pip install -e . pytest-covThe project uses just as a command runner. Install it separately:
# macOS
brew install just
# Linux
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/bin
# Or via cargo
cargo install justList all available recipes:
just helpRun lint check:
just lintRun type checking:
just typecheckRun the test suite:
just testRun tests with coverage:
just test-covRun performance benchmarks (excluded from default CI):
# full perf suite
just bench
# scaling-only suite
just bench-scale
# Python vs Octave compare suite
just bench-octaveValidate deterministic reproducibility for a fixed-seed pilot setting:
just bench-study-repropca_full(..., runtime_report=1) can be used to include a RuntimeReport
diagnostic block showing resolved per-kernel thread values and their sources.
just test runs the standard suite and may skip optional Octave-backed parity tests if Octave tooling is unavailable.
To run all parity tests (including sparse MEX-backed regression paths), you must install:
octaveoctave-dev(providesmkoctfileon Ubuntu/Debian)
Example (Ubuntu/Debian):
sudo apt-get install octave octave-devThen run:
just test-allThis command rebuilds host-specific MEX helpers in tools/ before testing.
If you previously ran tests on another OS/architecture, clear stale binaries before rerunning:
just mex-cleanRun the entire validation pipeline (lint -> typecheck -> test):
just ciLegacy MATLAB/Octave helpers (in tools/) are optional; they require Octave installed plus the octave extra if you want to call them from Python or run any Octave-dependent tests.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with clear commit messages
- Run the test suite (
just ci) to ensure all tests pass - Submit a pull request
For major changes, please open an issue first to discuss what you would like to change.
We are committed to providing a welcoming and inclusive environment. Please:
- Be respectful and constructive in all interactions
- Follow the project's coding style (enforced by
ruff) - Write tests for new functionality
If you use this package in your research, please cite:
For the implementation:
@software{vbpca_py2026,
author = {Naim, Shany and Macdonald, Joshua and Ram, Yoav},
title = {{VBPCApy}: Variational Bayesian PCA with Missing Data Support},
year = {2026},
url = {https://github.com/yoavram-lab/VBPCApy},
version = {0.1.0},
}For the algorithm:
@article{ilin2010practical,
title={Practical Approaches to Principal Component Analysis in the Presence of Missing Values},
author={Ilin, Alexander and Raiko, Tapani},
journal={Journal of Machine Learning Research},
volume={11},
pages={1957--2000},
year={2010}
}See CITATION.cff for machine-readable citation metadata.
This project is licensed under the MIT License - see the LICENSE file for details.
This implementation is based on the Variational Bayesian PCA algorithm described by Ilin and Raiko (2010), with inspiration from the original MATLAB reference implementation.