Skip to content

Benchmarks

Deep-UQ includes a benchmarking suite to compare UQ methods on calibration, sharpness, and computational cost.

Running Benchmarks

pip install "uqdeepnn[benchmarks]"
python -m benchmarks.run_benchmarks

Metrics

Metric Description
RMSE Root mean squared error of predictive mean
NLL Negative log-likelihood under predicted distribution
Calibration Error Deviation from ideal coverage across confidence levels
Sharpness Average width of prediction intervals
Inference Time Wall-clock time for predict_uq() call

Method Comparison (1D Regression)

Method RMSE NLL Calibration Time (ms)
Deep Ensemble (5) 0.042 -1.82 0.018 12.3
Laplace (kron) 0.044 -1.75 0.024 3.1
Laplace (diag) 0.044 -1.68 0.031 1.8
MC Dropout (50) 0.048 -1.61 0.035 8.7
SGLD (20 samples) 0.043 -1.78 0.021 45.2
Bayes by Backprop 0.046 -1.72 0.028 6.4

Reproducibility

Results are from the built-in benchmark suite using default configurations. Run python -m benchmarks.run_benchmarks --seed 42 to reproduce.

Custom Benchmarks

from benchmarks.config import BenchmarkConfig
from benchmarks.run_benchmarks import run_benchmark_suite

config = BenchmarkConfig(
    methods=["ensemble", "laplace_kron", "mc_dropout"],
    datasets=["sine", "uci_energy"],
    n_trials=5,
)
results = run_benchmark_suite(config)

See the benchmarks/ directory for full configuration options.