Benchmarks¶

Deep-UQ includes a benchmarking suite to compare UQ methods on calibration, sharpness, and computational cost.

Running Benchmarks¶

pip install "uqdeepnn[benchmarks]"
python -m benchmarks.run_benchmarks

Metrics¶

Metric	Description
RMSE	Root mean squared error of predictive mean
NLL	Negative log-likelihood under predicted distribution
Calibration Error	Deviation from ideal coverage across confidence levels
Sharpness	Average width of prediction intervals
Inference Time	Wall-clock time for `predict_uq()` call

Method Comparison (1D Regression)¶

Method	RMSE	NLL	Calibration	Time (ms)
Deep Ensemble (5)	0.042	-1.82	0.018	12.3
Laplace (kron)	0.044	-1.75	0.024	3.1
Laplace (diag)	0.044	-1.68	0.031	1.8
MC Dropout (50)	0.048	-1.61	0.035	8.7
SGLD (20 samples)	0.043	-1.78	0.021	45.2
Bayes by Backprop	0.046	-1.72	0.028	6.4

Reproducibility

Results are from the built-in benchmark suite using default configurations. Run python -m benchmarks.run_benchmarks --seed 42 to reproduce.

Custom Benchmarks¶

from benchmarks.config import BenchmarkConfig
from benchmarks.run_benchmarks import run_benchmark_suite

config = BenchmarkConfig(
    methods=["ensemble", "laplace_kron", "mc_dropout"],
    datasets=["sine", "uci_energy"],
    n_trials=5,
)
results = run_benchmark_suite(config)

See the benchmarks/ directory for full configuration options.