Deep-UQ Development Roadmap¶

This document defines the complete implementation plan for making Deep-UQ the definitive uncertainty quantification toolkit for deep learning. Each phase builds on the previous, with clear deliverables, implementation guidance, and acceptance criteria.

Current State (v0.1.x)¶

Methods: Deep Ensembles, VI (BBB + Last-Layer), Laplace (6 Hessian structures), SGLD, MC Dropout, GPs (exact/sparse/multitask/deep kernel/spectral mixture/heteroscedastic/classification), Conformal Prediction (split, CQR, classification)

Models: MLP, PINN, CNN, ResNet, UNet, DeepONet (1D/2D), FNO (2D/3D), GNO, Diffusion

API: Unified predict_uq() → UQResult(mean, epistemic_var, aleatoric_var, total_var)

Phase 1: Foundation & Calibration (v0.2.x)¶

Timeline: 6–8 weeks
Goal: Fill the most-requested gaps and add proper evaluation infrastructure.

1.1 SWAG (Stochastic Weight Averaging — Gaussian)¶

What: Collect first and second moments of SGD trajectory after convergence; use them as a Gaussian posterior approximation.

Why critical: Single-training-run Bayesian approximation. Cheaper than ensembles, better than MC Dropout. The most commonly requested missing method in UQ toolkits.

Implementation plan:

src/deepuq/methods/swag.py
├── class SWAGCollector
│   ├── __init__(model, max_rank=20, collection_freq=1)
│   ├── collect(model)          # call after each epoch in SWA phase
│   ├── finalize()              # compute low-rank + diagonal covariance
│   └── state_dict / load_state_dict
├── class SWAGWrapper
│   ├── __init__(base_model, swag_collector, ...)
│   ├── sample_parameters(scale=1.0, diag_noise=True)
│   ├── predict_uq(x, n_samples=30) → UQResult
│   └── log_prob(x, y)         # marginal likelihood estimate
└── class MultiSWAG
    ├── __init__(swag_wrappers: list)
    └── predict_uq(x, n_samples_per_model=10) → UQResult

Key design decisions:

Store running mean, running squared mean (diagonal), and a low-rank deviation matrix (columns = θ_i - θ_mean)
Low-rank matrix capped at max_rank columns (default 20); older deviations discarded FIFO
Sampling: θ ~ N(θ_mean, 0.5 * Σ_diag + 0.5 * Σ_lowrank) per Maddox et al. 2019
MultiSWAG: run SWAG collection from K different initializations → combine predictions

How a scientist implements this:

Train your model normally to convergence
Switch learning rate to a constant (SWA learning rate)
Call collector.collect(model) after every epoch for N additional epochs
Call collector.finalize()
Wrap: swag = SWAGWrapper(model, collector)
Call swag.predict_uq(x_test)

Acceptance criteria:

Matches published SWAG results on UCI regression (within 5% NLL)
Works with all model architectures (MLP, FNO, DeepONet, etc.)
Memory overhead < 2x model size for rank-20
predict_uq() returns valid UQResult

References: Maddox et al., "A Simple Baseline for Bayesian Inference in Deep Learning" (NeurIPS 2019)

1.2 Post-Hoc Calibration Methods¶

What: Methods that adjust a trained model's confidence outputs to be calibrated, without retraining.

Why critical: Nearly every deployed model needs post-hoc calibration. Temperature scaling is the minimum viable UQ for production classification.

Implementation plan:

src/deepuq/methods/calibration/
├── __init__.py
├── _temperature.py
│   ├── class TemperatureScaling
│   │   ├── __init__(model)
│   │   ├── fit(val_loader)              # optimize T on validation NLL
│   │   ├── predict_calibrated(x) → probs
│   │   └── predict_uq(x) → UQResult
│   └── class VectorScaling
│       └── (per-class temperature + bias)
├── _isotonic.py
│   ├── class IsotonicCalibration
│   │   ├── fit(val_loader)              # fit isotonic regression per class
│   │   └── predict_calibrated(x) → probs
│   └── class BetaCalibration
├── _histogram.py
│   └── class HistogramBinning
└── _focal.py
    └── class FocalLossCalibration       # focal loss as implicit calibration

How a scientist implements this:

from deepuq.methods.calibration import TemperatureScaling

# After training
ts = TemperatureScaling(trained_model)
ts.fit(val_loader)  # learns optimal temperature on held-out data

result = ts.predict_uq(x_test)
# result.mean = calibrated probabilities

Acceptance criteria:

Temperature scaling reduces ECE by >50% on CIFAR-10 ResNet
Isotonic calibration produces perfectly calibrated histograms on validation
All methods work for both binary and multi-class classification
Integrates with existing UQResult (calibrated probabilities as mean, entropy as variance)

1.3 Calibration & Evaluation Metrics Module¶

What: Comprehensive metrics for evaluating UQ quality.

Why critical: You can't improve what you can't measure. Every paper needs these metrics.

Implementation plan:

src/deepuq/metrics/
├── __init__.py
├── calibration.py
│   ├── expected_calibration_error(probs, labels, n_bins=15) → float
│   ├── maximum_calibration_error(probs, labels, n_bins=15) → float
│   ├── adaptive_calibration_error(probs, labels, n_bins=15) → float
│   ├── reliability_diagram(probs, labels) → (bin_confs, bin_accs, bin_counts)
│   ├── calibration_curve_regression(predicted_var, residuals, quantiles) → coverage
│   └── prediction_interval_coverage(lower, upper, y_true) → float
├── scoring.py
│   ├── negative_log_likelihood(mean, var, y_true) → float
│   ├── continuous_ranked_probability_score(mean, var, y_true) → float
│   ├── brier_score(probs, labels) → float
│   ├── log_score(probs, labels) → float
│   ├── interval_score(lower, upper, y_true, alpha) → float
│   └── energy_score(samples, y_true) → float
├── sharpness.py
│   ├── mean_prediction_interval_width(lower, upper) → float
│   ├── coefficient_of_variation(mean, var) → float
│   └── sharpness_calibration_tradeoff(mean, var, y_true) → dict
├── ood.py
│   ├── auroc_ood(in_scores, out_scores) → float
│   ├── auprc_ood(in_scores, out_scores) → float
│   ├── fpr_at_tpr(in_scores, out_scores, tpr=0.95) → float
│   └── detection_error(in_scores, out_scores) → float
├── selective.py
│   ├── risk_coverage_curve(uncertainties, errors) → (coverage, risk)
│   ├── aurc(uncertainties, errors) → float          # area under risk-coverage
│   ├── eaurc(uncertainties, errors) → float         # excess AURC
│   └── selective_accuracy(uncertainties, errors, coverage=0.8) → float
└── visualization.py
    ├── plot_reliability_diagram(probs, labels, ax=None) → matplotlib.Axes
    ├── plot_calibration_regression(predicted_var, residuals, ax=None)
    ├── plot_risk_coverage(uncertainties, errors, ax=None)
    └── plot_uncertainty_histogram(epistemic, aleatoric, ax=None)

How a scientist uses this:

from deepuq.metrics import (
    expected_calibration_error,
    continuous_ranked_probability_score,
    auroc_ood,
    plot_reliability_diagram,
)

# After getting predictions
result = model.predict_uq(x_test)

# Regression metrics
crps = continuous_ranked_probability_score(result.mean, result.total_var, y_test)

# Classification calibration
ece = expected_calibration_error(result.mean, y_test, n_bins=15)
plot_reliability_diagram(result.mean, y_test)

# OOD detection
in_uncertainty = model.predict_uq(x_in).epistemic_var.mean(dim=-1)
out_uncertainty = model.predict_uq(x_ood).epistemic_var.mean(dim=-1)
auroc = auroc_ood(in_uncertainty, out_uncertainty)

Acceptance criteria:

ECE implementation matches published values on standard benchmarks
CRPS computed correctly for Gaussian predictive distributions
All metrics accept both UQResult and raw tensors
Visualizations produce publication-quality figures
100% test coverage on edge cases (zero variance, single sample, etc.)

1.4 Selective Prediction Module¶

What: Reject predictions where uncertainty is too high; report accuracy only on accepted predictions.

src/deepuq/methods/selective.py
├── class SelectivePredictor
│   ├── __init__(model, criterion="epistemic_var")
│   ├── predict_with_rejection(x, threshold=None, coverage=0.8)
│   │   → (predictions, mask_accepted, uncertainties)
│   ├── find_threshold(val_loader, target_coverage=0.8) → float
│   └── evaluate(test_loader) → SelectiveMetrics
└── class SelectiveMetrics
    ├── coverage: float
    ├── selective_accuracy: float
    ├── rejection_rate: float
    └── risk_coverage_auc: float

Phase 2: Scalable Production Methods (v0.3.x)¶

Timeline: 6–8 weeks
Goal: Methods that scale to real models (ResNets, Transformers, foundation models).

2.1 SNGP (Spectral-Normalized Neural Gaussian Process)¶

What: Replace the last layer with a random-feature GP, apply spectral normalization to hidden layers to preserve distance awareness.

Why critical: Google's recommended UQ method for production. Single forward pass. No ensembles needed. Works on ImageNet-scale models.

Implementation plan:

src/deepuq/methods/sngp.py
├── class SpectralNormalization
│   └── Wrapper that applies spectral norm to all Linear/Conv layers
├── class RandomFeatureGPLayer
│   ├── __init__(in_features, num_classes, num_random_features=1024)
│   ├── reset_covariance()      # call at start of each epoch
│   ├── update_covariance(x)    # accumulate precision matrix
│   ├── forward(x) → (logits, covariance)
│   └── predict_uq(x) → UQResult
└── class SNGPWrapper
    ├── __init__(base_model, last_layer_name, num_random_features=1024, ...)
    ├── apply_spectral_norm(norm_bound=6.0)
    ├── fit(train_loader)       # accumulate covariance during one pass
    └── predict_uq(x) → UQResult

Key design decisions:

Spectral normalization with configurable bound (default 6.0) on all hidden layers
Random Fourier Features (RFF) for the GP approximation (not inducing points)
Mean-field approximation for the posterior covariance
During training: accumulate precision matrix in a streaming fashion
During inference: single forward pass → logits + variance from Laplace approx on last layer

How a scientist implements this:

from deepuq.methods import SNGPWrapper

# Wrap any pre-trained or new model
sngp = SNGPWrapper(model, last_layer_name="fc3", num_random_features=1024)
sngp.apply_spectral_norm(norm_bound=6.0)

# Train normally (SNGP modifies forward pass)
for epoch in range(n_epochs):
    sngp.reset_covariance()
    for x, y in train_loader:
        logits, cov = sngp(x)
        loss = F.cross_entropy(logits, y)
        loss.backward()
        optimizer.step()
        sngp.update_covariance(x)

# Inference
result = sngp.predict_uq(x_test)

Acceptance criteria:

Matches published SNGP AUROC on CIFAR-10 vs SVHN OOD benchmark
Single forward pass inference (no sampling)
Compatible with CNN, ResNet, MLP architectures
<10% latency overhead vs vanilla model

References: Liu et al., "Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness" (NeurIPS 2020)

2.2 Batch Ensemble¶

What: Share most parameters across ensemble members; each member has a rank-1 multiplicative perturbation (r_i, s_i) applied to each weight matrix.

Why critical: N ensemble members in ~1x memory. Trains in a single forward/backward pass per batch by assigning different batch elements to different members.

Implementation plan:

src/deepuq/methods/batch_ensemble.py
├── class BatchEnsembleLinear(nn.Module)
│   ├── __init__(in_features, out_features, ensemble_size)
│   ├── Parameters: weight (shared), r_i, s_i (per member), bias_i
│   └── forward(x) → (batch * ensemble_size, out_features)
├── class BatchEnsembleConv2d(nn.Module)
│   └── Same pattern for convolutions
├── class BatchEnsembleWrapper
│   ├── __init__(base_model, ensemble_size=4)
│   ├── convert_to_batch_ensemble()   # replace Linear/Conv with BE versions
│   ├── forward(x) → stacked predictions
│   └── predict_uq(x) → UQResult
└── Utility: replicate_batch(x, ensemble_size) → x repeated for each member

How a scientist implements this:

from deepuq.methods import BatchEnsembleWrapper

be = BatchEnsembleWrapper(model, ensemble_size=4)
be.convert_to_batch_ensemble()

# Train: batch is automatically split across ensemble members
for x, y in train_loader:
    preds = be(x)  # shape: (batch * 4, output_dim)
    loss = compute_loss(preds, y.repeat(4))
    loss.backward()
    optimizer.step()

result = be.predict_uq(x_test)

Acceptance criteria:

Memory overhead < 5% vs single model for ensemble_size=4
Training throughput within 2x of single model
Diversity between members (disagreement > random)
Matches published BatchEnsemble accuracy on CIFAR-10

References: Wen et al., "BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning" (ICLR 2020)

2.3 Packed Ensembles¶

What: Partition network channels into subgroups; each subgroup is an independent "sub-network". All sub-networks share the same architecture but have independent weights within their channel partition.

Implementation plan:

src/deepuq/methods/packed_ensemble.py
├── class PackedLinear(nn.Module)
│   ├── __init__(in_features, out_features, num_packs, ...)
│   └── forward(x) → grouped computation
├── class PackedConv2d(nn.Module)
│   └── Uses grouped convolution (groups=num_packs)
└── class PackedEnsembleWrapper
    ├── __init__(base_model, num_packs=4, alpha=2)
    ├── convert_to_packed()
    └── predict_uq(x) → UQResult

References: Laurent et al., "Packed-Ensembles for Efficient Uncertainty Estimation" (ICLR 2023)

2.4 Improved MCMC¶

What: SGHMC (momentum-based SGLD) and Cyclical SGMCMC (exploration/exploitation cycles).

src/deepuq/methods/mcmc.py  (extend existing)
├── class SGHMCOptimizer(Optimizer)
│   ├── __init__(params, lr, momentum_decay, noise_scale)
│   └── step()   # includes friction + noise
├── class CyclicalSGMCMC
│   ├── __init__(model, optimizer_cls, cycle_length, n_cycles, ...)
│   ├── run(train_loader, n_epochs) → list[state_dict]  # collected samples
│   └── Internal: cosine schedule within each cycle, collect at cycle end
└── class PosteriorPredictive
    ├── __init__(base_model, samples: list[state_dict])
    └── predict_uq(x, n_samples=None) → UQResult

References: Chen et al., "Stochastic Gradient Hamiltonian Monte Carlo" (ICML 2014); Zhang et al., "Cyclical Stochastic Gradient MCMC" (ICLR 2020)

2.5 Flipout for VI¶

What: Per-example weight perturbations that are decorrelated across the batch. Better gradient variance than standard reparameterization (BBB).

src/deepuq/methods/vi.py  (extend existing)
├── class FlipoutLinear(nn.Module)
│   ├── __init__(in_features, out_features)
│   ├── Parameters: weight_mu, weight_sigma (log-scale)
│   └── forward(x) → applies random sign flips per sample
└── class FlipoutMLP
    ├── __init__(input_dim, hidden_dims, output_dim)
    └── predict_uq(x, n_samples=30) → UQResult

References: Wen et al., "Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches" (ICLR 2018)

2.6 Linearized Laplace Predictive¶

What: Instead of sampling from the Laplace posterior, use the GLM (Generalized Linear Model) predictive. Linearize the network around the MAP and compute the exact predictive distribution analytically.

src/deepuq/methods/laplace/_wrapper.py  (extend existing)
├── LaplaceWrapper.predict_uq(..., method="glm")  # new option
│   # Computes: f(x) ≈ f_MAP(x) + J(x)(θ - θ_MAP)
│   # Predictive: N(f_MAP(x), J(x) @ Σ_post @ J(x)^T)
│   # No sampling needed; exact Gaussian predictive

Why: Better OOD detection than sample-based Laplace. Faster inference. Theoretically grounded.

References: Immer et al., "Improving predictions of Bayesian neural nets via local linearization" (AISTATS 2021)

Phase 3: Deployment & Integration (v0.4.x)¶

Timeline: 6–8 weeks
Goal: Make Deep-UQ usable in production pipelines, not just research notebooks.

3.1 Active Learning Module¶

What: Use uncertainty to select the most informative data points for labeling.

src/deepuq/active/
├── __init__.py
├── strategies.py
│   ├── class UncertaintySampling
│   │   ├── __init__(model, criterion="epistemic_var")
│   │   └── select(pool_loader, n_samples) → indices
│   ├── class BALDSampling                # Bayesian Active Learning by Disagreement
│   │   └── select(pool_loader, n_samples) → indices
│   ├── class BatchBALD
│   │   └── select(pool_loader, n_samples) → indices
│   ├── class CoreSet
│   │   └── select(pool_loader, n_samples) → indices
│   └── class ExpectedModelChange
│       └── select(pool_loader, n_samples) → indices
├── loop.py
│   └── class ActiveLearningLoop
│       ├── __init__(model, strategy, train_fn, pool_dataset, ...)
│       ├── step() → (selected_indices, current_metrics)
│       └── run(n_iterations, n_samples_per_iter) → history
└── visualization.py
    └── plot_learning_curve(history)

How a scientist uses this:

from deepuq.active import ActiveLearningLoop, BALDSampling

strategy = BALDSampling(model, n_mc_samples=20)
loop = ActiveLearningLoop(
    model=model,
    strategy=strategy,
    train_fn=my_train_function,
    pool_dataset=unlabeled_data,
    val_dataset=val_data,
)

history = loop.run(n_iterations=20, n_samples_per_iter=50)

3.2 Bayesian Optimization Module¶

What: Use GPs (from existing deepuq.models.gp) for sequential optimization of expensive black-box functions.

src/deepuq/optim/
├── __init__.py
├── acquisition.py
│   ├── expected_improvement(model, X_candidate, best_y) → scores
│   ├── upper_confidence_bound(model, X_candidate, beta=2.0) → scores
│   ├── probability_of_improvement(model, X_candidate, best_y) → scores
│   └── thompson_sampling(model, X_candidate) → scores
├── bo.py
│   └── class BayesianOptimizer
│       ├── __init__(bounds, kernel, acquisition="ei", ...)
│       ├── suggest(n_suggestions=1) → X_next
│       ├── observe(X, y)
│       ├── run(objective_fn, n_iterations) → OptResult
│       └── get_model() → GaussianProcessRegressor
└── visualization.py
    ├── plot_acquisition(optimizer, ax=None)
    └── plot_convergence(optimizer, ax=None)

3.3 Model Export with UQ¶

What: Export UQ-wrapped models for deployment in non-Python environments.

src/deepuq/export/
├── __init__.py
├── torchscript.py
│   ├── export_ensemble(ensemble_wrapper, sample_input) → ScriptModule
│   ├── export_mc_dropout(mc_wrapper, sample_input, n_forward=30) → ScriptModule
│   └── export_laplace_linearized(laplace_wrapper, sample_input) → ScriptModule
├── onnx.py
│   ├── export_to_onnx(wrapper, sample_input, path, ...)
│   └── Handles: mean output + variance output as multi-output graph
└── utils.py
    └── validate_export(original, exported, sample_input, atol=1e-5)

3.4 Distributed Ensemble Training¶

What: Train ensemble members across multiple GPUs/nodes efficiently.

src/deepuq/distributed/
├── __init__.py
├── parallel_ensemble.py
│   └── class DistributedEnsembleTrainer
│       ├── __init__(model_fn, ensemble_size, ...)
│       ├── train(train_loader, n_epochs, ...) → list[model]
│       └── Internal: each GPU trains one member, sync at predict time
└── utils.py
    └── gather_predictions(local_preds, world_size) → combined

3.5 Evidential Deep Learning¶

What: Train a network to output parameters of a Dirichlet (classification) or Normal-Inverse-Gamma (regression) distribution. Single forward pass gives both prediction and uncertainty.

src/deepuq/methods/evidential.py
├── class EvidentialRegression
│   ├── __init__(base_model)   # model outputs (γ, ν, α, β) per output
│   ├── loss(x, y) → NIG negative log-likelihood + regularizer
│   ├── predict_uq(x) → UQResult
│   │   # epistemic_var = β / (ν * (α - 1))
│   │   # aleatoric_var = β / (α - 1)
│   └── uncertainty_type: "evidential"
└── class EvidentialClassification
    ├── __init__(base_model, num_classes)  # model outputs Dirichlet concentrations
    ├── loss(x, y) → Dirichlet likelihood + KL regularizer
    └── predict_uq(x) → UQResult
        # epistemic_var from Dirichlet uncertainty (K / sum_alpha)

References: Amini et al., "Deep Evidential Regression" (NeurIPS 2020); Sensoy et al., "Evidential Deep Learning to Quantify Classification Uncertainty" (NeurIPS 2018)

Phase 4: Scientific ML Depth (v0.5.x)¶

Timeline: 8–10 weeks
Goal: Become the reference toolkit for UQ in computational science.

4.1 Multi-Fidelity UQ¶

What: Combine data from cheap (low-fidelity) and expensive (high-fidelity) simulations. The GP learns a correlation structure between fidelities.

src/deepuq/models/gp/multifidelity.py
├── class MultiFidelityGP
│   ├── __init__(kernel_lo, kernel_hi, rho_prior=None)
│   ├── fit(X_lo, y_lo, X_hi, y_hi)
│   ├── predict_uq(X_new, fidelity="high") → UQResult
│   └── Information gain / value of information computation
└── class DeepMultiFidelityGP
    ├── __init__(feature_extractor, ...)
    └── Nonlinear fidelity correlation via neural network

Use case: You have 10,000 coarse mesh CFD runs and 50 fine mesh runs. Train on both, predict with high-fidelity uncertainty.

4.2 Physics-Constrained Uncertainty¶

What: Ensure that prediction intervals respect known physical constraints (conservation laws, monotonicity, positivity, symmetry).

src/deepuq/constraints/
├── __init__.py
├── hard.py
│   ├── class PositivityConstraint      # clip lower bound at 0
│   ├── class ConservationConstraint     # adjust intervals to respect ∫u dx = const
│   ├── class MonotonicityConstraint     # enforce ordered quantiles
│   └── class BoundConstraint           # enforce known min/max
├── soft.py
│   ├── class PhysicsRegularizedUQ      # penalize intervals that violate PDE residual
│   └── class ConstraintAwareLoss       # augmented loss with constraint terms
└── wrappers.py
    └── class ConstrainedUQResult(UQResult)
        └── Applies constraints post-hoc to any UQResult

How a scientist uses this:

from deepuq.constraints import ConservationConstraint, ConstrainedUQResult

# Mass must be conserved: total integral = 1.0
constraint = ConservationConstraint(
    integration_weights=dx,  # quadrature weights
    conserved_quantity=1.0,
)

raw_result = model.predict_uq(x_test)
constrained = ConstrainedUQResult(raw_result, constraints=[constraint])
# constrained.mean integrates to 1.0
# constrained.total_var adjusted to respect feasible region

4.3 Spatiotemporal Uncertainty Propagation¶

What: When rolling out a time-dependent PDE solver autoregressively, uncertainty grows at each step. Track and propagate this correctly.

src/deepuq/propagation/
├── __init__.py
├── rollout.py
│   └── class UncertaintyRollout
│       ├── __init__(model, n_steps, propagation="moment_matching")
│       ├── predict_trajectory(x0, n_steps) → list[UQResult]
│       │   # Options: "moment_matching", "sampling", "unscented"
│       └── uncertainty_growth_rate(trajectory) → float
├── moment_matching.py
│   └── Propagate mean + covariance through the model using linearization
├── unscented.py
│   └── Sigma-point propagation (no Jacobians needed)
└── sampling.py
    └── Particle-based propagation (ensemble of trajectories)

Use case: FNO trained on Navier-Stokes predicts 10 timesteps ahead. At step 1, epistemic variance is small. By step 10, it's grown significantly. This module tracks that growth faithfully.

4.4 Neural ODE/SDE with UQ¶

What: Continuous-depth models where the dynamics themselves have uncertainty.

src/deepuq/models/neural_ode.py
├── class NeuralODE
│   ├── __init__(dynamics_net, solver="dopri5")
│   └── forward(x0, t_span) → trajectory
├── class BayesianNeuralODE
│   ├── __init__(dynamics_net, uq_method="swag")
│   └── predict_uq(x0, t_span, n_samples=30) → list[UQResult]
└── class NeuralSDE
    ├── __init__(drift_net, diffusion_net, solver="euler_maruyama")
    ├── forward(x0, t_span) → trajectory (stochastic)
    └── predict_uq(x0, t_span, n_paths=100) → UQResult

4.5 Functional Priors for Neural Operators¶

What: Instead of weight-space priors (which are hard to interpret), define priors in function space: "I believe the operator output is smooth" or "output should look like a GP with Matern kernel."

src/deepuq/priors/
├── __init__.py
├── functional.py
│   ├── class GPFunctionalPrior
│   │   ├── __init__(kernel, input_points)
│   │   └── log_prob(f_samples) → float
│   ├── class SmoothnesssPrior
│   │   ├── __init__(smoothness_order=2)
│   │   └── log_prob(f_samples) → penalizes high-frequency content
│   └── class PhysicsPrior
│       ├── __init__(pde_residual_fn)
│       └── log_prob(f_samples) → penalizes PDE residual
└── integration.py
    └── Utilities to combine functional priors with SWAG/Laplace/VI

4.6 Uncertainty for Sequence Models¶

What: UQ methods designed for Transformers, RNNs, and other sequence architectures.

src/deepuq/models/sequence.py
├── class UncertainTransformer
│   ├── __init__(d_model, nhead, num_layers, ...)
│   ├── Stochastic attention (dropout + ensemble heads)
│   └── predict_uq(x_seq) → UQResult (per-token uncertainty)
├── class RecurrentEnsemble
│   ├── __init__(cell_type="lstm", hidden_size, ensemble_size)
│   └── predict_uq(x_seq) → UQResult
└── class BayesianTransformerLayer
    └── Last-layer VI or Laplace on transformer output projection

Phase 5: Research Frontier (v0.6.x)¶

Timeline: 8–10 weeks
Goal: Implement cutting-edge methods before other toolkits.

5.1 Epistemic Neural Networks (ENN)¶

What: DeepMind's framework. An "epinet" — a small auxiliary network — is trained to predict epistemic uncertainty given the base model's features + a random seed.

src/deepuq/methods/enn.py
├── class EpiNet
│   ├── __init__(feature_dim, hidden_dims, output_dim, n_basis=50)
│   ├── forward(features, z_index) → epistemic_perturbation
│   └── Parameters: small MLP + learnable basis vectors
└── class ENNWrapper
    ├── __init__(base_model, feature_layer, epinet_hidden=[64, 64])
    ├── fit(train_loader, n_epochs)    # train epinet with randomized prior loss
    └── predict_uq(x, n_index=100) → UQResult
        # Sample different z indices, measure spread

References: Osband et al., "Epistemic Neural Networks" (NeurIPS 2023)

5.2 Conformal Prediction Under Distribution Shift¶

What: Standard conformal assumes exchangeability. Under shift, coverage guarantees break. Weighted and adaptive methods restore guarantees.

src/deepuq/methods/conformal/  (extend existing)
├── _weighted.py
│   └── class WeightedConformalPredictor
│       ├── __init__(model, weight_fn)   # importance weights for shift correction
│       └── calibrate(cal_loader, weights)
├── _adaptive.py
│   └── class AdaptiveConformalPredictor
│       ├── __init__(model, target_coverage=0.9)
│       ├── update(x_new, y_new)         # online update of threshold
│       └── predict_set(x) → intervals that adapt over time
└── _mondrian.py
    └── class MondrianConformalPredictor
        └── Group-conditional coverage (per-class or per-region)

5.3 Stein Variational Gradient Descent (SVGD)¶

What: Particle-based inference. Maintain K "particles" (model copies) and update them with a repulsive kernel + gradient to approximate the posterior.

src/deepuq/methods/svgd.py
├── class SVGDOptimizer
│   ├── __init__(particles: list[nn.Module], kernel="rbf", bandwidth="median")
│   ├── step(loss_fn, x, y)   # SVGD update on all particles
│   └── kernel_matrix(params) → (K, grad_K)
└── class SVGDWrapper
    ├── __init__(model_fn, n_particles=10, ...)
    ├── fit(train_loader, n_epochs)
    └── predict_uq(x) → UQResult (from particle disagreement)

References: Liu & Wang, "Stein Variational Gradient Descent" (NeurIPS 2016)

5.4 PAC-Bayes Bounds¶

What: Compute theoretical generalization guarantees. Given a posterior over weights, compute a certificate: "with probability 1-δ, the true risk is at most X."

src/deepuq/bounds/
├── __init__.py
├── pac_bayes.py
│   ├── mcallester_bound(kl_divergence, n_train, delta=0.05) → float
│   ├── catoni_bound(empirical_risk, kl_divergence, n_train, delta=0.05) → float
│   └── data_dependent_bound(model, train_loader, prior, delta=0.05) → float
└── compute.py
    └── class PACBayesCertifier
        ├── __init__(model, prior, posterior)
        ├── compute_kl() → float
        ├── compute_bound(train_loader, delta=0.05) → float
        └── optimize_bound(train_loader) → optimal_lambda

5.5 Posterior Networks¶

What: Use a normalizing flow to model the posterior predictive distribution. Instead of predicting parameters of a fixed distribution, predict an arbitrary density.

src/deepuq/methods/posterior_networks.py
├── class PosteriorNetwork
│   ├── __init__(encoder, flow, n_classes_or_output_dim)
│   ├── forward(x) → distribution parameters (concentration for Dirichlet)
│   ├── loss(x, y) → UCE loss (uncertainty cross-entropy)
│   └── predict_uq(x) → UQResult
│       # Epistemic: entropy of the predicted Dirichlet
│       # Aleatoric: expected entropy under the Dirichlet
└── class NormalizingFlowUQ
    ├── __init__(base_model, flow_layers=8)
    ├── forward(x) → samples from learned predictive distribution
    └── predict_uq(x, n_samples=200) → UQResult

References: Charpentier et al., "Posterior Network: Uncertainty Estimation without OOD Samples" (NeurIPS 2020)

5.6 Test-Time Augmentation UQ¶

What: Apply random augmentations at test time, measure prediction variance across augmented versions.

src/deepuq/methods/tta.py
└── class TTAWrapper
    ├── __init__(model, augmentations, n_augmentations=30)
    └── predict_uq(x) → UQResult
        # Apply each augmentation, forward pass, measure spread

Implementation Guidelines for Contributors¶

Architecture Principles¶

Every method must return UQResult — no exceptions. This is the API contract.
Every method must accept any nn.Module as base model (except architecture-specific methods like SNGP which need to modify layers).
No external UQ dependencies — implement everything in pure PyTorch.
Lazy computation — don't compute aleatoric/epistemic split if the method doesn't support it; leave as None.
Device-agnostic — all methods must work on CPU and CUDA without code changes.

File Organization¶

src/deepuq/
├── methods/           # UQ algorithms (wrappers around models)
│   ├── calibration/   # post-hoc calibration
│   ├── conformal/     # conformal prediction variants
│   ├── laplace/       # Laplace approximation
│   └── *.py           # one file per method family
├── models/            # neural network architectures
│   └── gp/            # Gaussian process models
├── metrics/           # evaluation metrics
├── active/            # active learning
├── optim/             # Bayesian optimization
├── constraints/       # physics constraints for UQ
├── propagation/       # uncertainty propagation over time
├── bounds/            # PAC-Bayes and theoretical bounds
├── export/            # model export (ONNX, TorchScript)
├── distributed/       # multi-GPU training
├── priors/            # functional priors
├── types.py           # UQResult and shared types
└── utils.py           # shared utilities

Testing Requirements¶

Every new method needs:

Unit test: Does it run? Does it return valid UQResult?
Shape test: Various input shapes, batch sizes, output dimensions.
Correctness test: Compare against a known reference implementation or published numbers on a toy problem.
Integration test: Works with at least MLP + one scientific ML model (FNO or DeepONet).
Calibration test: On a simple problem, is the uncertainty actually calibrated?

# Template for method tests
def test_swag_basic():
    model = MLP(input_dim=2, hidden_dims=[32], output_dim=1)
    # ... train ...
    collector = SWAGCollector(model, max_rank=5)
    for _ in range(10):
        # ... train one epoch ...
        collector.collect(model)
    collector.finalize()

    swag = SWAGWrapper(model, collector)
    result = swag.predict_uq(torch.randn(16, 2))

    assert isinstance(result, UQResult)
    assert result.mean.shape == (16, 1)
    assert result.epistemic_var.shape == (16, 1)
    assert (result.epistemic_var > 0).all()

Documentation Requirements¶

Every new method needs:

API reference page: Auto-generated from docstrings via mkdocstrings
Method guide page in docs/methods/: Mathematical background, when to use, comparison with alternatives
Tutorial: End-to-end notebook showing the method on a real problem
Benchmarks entry: Added to the benchmark suite with published reference numbers

Contribution Workflow¶

1. Open issue describing the method + acceptance criteria
2. Create branch: feature/<method-name>
3. Implement in src/deepuq/methods/<name>.py
4. Add tests in tests/test_<name>.py
5. Add docs page + tutorial
6. Run benchmark comparison
7. PR with: implementation + tests + docs + benchmark results

Success Metrics¶

v0.2 (Phase 1 complete)¶

10+ UQ methods available
Comprehensive metrics module with visualization
All methods benchmarked on at least 2 standard datasets
ECE/CRPS numbers published in docs

v0.3 (Phase 2 complete)¶

Methods that scale to ImageNet-class models (SNGP, BatchEnsemble)
<2x overhead vs vanilla training for most methods
Single-pass UQ options available (SNGP, Evidential)

v0.4 (Phase 3 complete)¶

Active learning loop running on real problems
Models exportable to ONNX
Multi-GPU ensemble training working

v0.5 (Phase 4 complete)¶

Multi-fidelity workflows demonstrated on engineering problems
Spatiotemporal rollout UQ working with FNO/DeepONet
Physics-constrained intervals published

v0.6 (Phase 5 complete)¶

25+ UQ methods (most comprehensive toolkit available)
Theoretical bounds computable
Research-frontier methods available within 6 months of publication

Competitive Target¶

By v0.6, Deep-UQ should be the only toolkit where a researcher or engineer can:

Pick any PyTorch model architecture
Choose from 25+ UQ methods (from cheap post-hoc to full Bayesian)
Evaluate with proper metrics (calibration, sharpness, OOD, selective)
Deploy with export tools
Scale with distributed training
Apply physical constraints to uncertainty
Use in active learning / Bayesian optimization loops
Get theoretical guarantees (PAC-Bayes)

All with zero external UQ dependencies and a single unified API.