Backtest metrics

merton.backtest provides standalone implementations of the metrics most commonly used in credit-risk model validation. Each matches the equivalent scikit-learn metric to machine precision; we avoid the sklearn dependency because real validation suites need to be auditable without optional deps.

Metric

Function

Range

Interpretation

AUC

auc(p, y)

[0, 1]

Probability a random defaulter has higher PD than a random non-defaulter. 0.5 is random; 1.0 is perfect.

Accuracy Ratio (Gini)

accuracy_ratio(p, y)

[-1, 1]

2·AUC 1; favoured by Moody’s and some regulators.

Brier

brier(p, y)

[0, 1]

Mean squared error between PD and the 0/1 default indicator.

KS statistic

ks_statistic(p, y)

[0, 1]

Max gap between the defaulter and non-defaulter CDFs of PD.

Hosmer-Lemeshow χ²

hosmer_lemeshow(p, y, bins=10)

(0, ∞)

Goodness-of-fit χ² statistic. Use scipy.stats.chi2.sf(chi2, dof) for the p-value.

Calibration curves

from merton.backtest import calibration_curve, calibration_plot
cc = calibration_curve(predictions, defaults, bins=10, strategy="quantile")
calibration_plot(predictions, defaults)

A well-calibrated model has cc.fraction_positives cc.mean_predicted in every bucket. Persistent over- or under-prediction shows up as the curve diverging from the 45° reference.

Rolling window

from merton.backtest import rolling_window
result = rolling_window(panel_df, window="252D", step="21D",
                        pd_col="pd", default_col="default", date_col="date")
result.to_pandas()  # one row per window with AUC, AR, Brier, KS