DEV Community 1h ago

Credit risk is more than predicting default: building the full stack in Python (IFRS 9 ECL, scorecards, monitoring)

The data and the splits

The tape is one file of ~2.26M loans. The trick is that it is really three populations, and you need different slices for different jobs:

CHARGED = {'Charged Off', 'Default', 'Does not meet the credit policy. Status:Charged Off'}
PAID = {'Fully Paid', 'Does not meet the credit policy. Status:Fully Paid'}
ACTIVE = {'Current', 'In Grace Period', 'Late (16-30 days)', 'Late (31-120 days)'}

df['completed'] = df['loan_status'].isin(CHARGED | PAID)  # for PD training
df['default'] = df['loan_status'].isin(CHARGED).astype(int)
df['active'] = df['loan_status'].isin(ACTIVE) & (df['out_prncp'] > 0)  # the live book

Two habits that run through everything:

Out-of-time validation, not random split. Train on older vintages, test on newer ones, because that is the only test that tells you how the model behaves on loans it has not seen.

dev = df[df['issue_year'] <= 2015]  # build here
oot = df[df['issue_year'] >= 2016]  # judge here

No leakage. I drop the platform's own grade and interest rate from every model, so it earns its signal from borrower attributes rather than copying someone else's pricing.

And one engineering note up front: the raw file is ~390MB gzipped, so read it in chunks with usecols to keep memory sane:

for ch in pd.read_csv(RAW, usecols=COLS, chunksize=300_000, low_memory=False):
    ...

Part 1: the IFRS 9 ECL engine

The whole thing reduces to one line, ECL = PD x LGD x EAD, discounted and summed, but each term is its own small project.

PD. A logistic model, validated out of time. The model gives a lifetime PD; for the 12-month figure that Stage 1 needs, convert it under a constant-hazard assumption over the remaining term:

pd_12m = 1 - (1 - pd_life) ** (np.minimum(12, rem_months) / rem_months)

LGD, measured not assumed. This is the bit most tutorials skip. For each charged-off loan, exposure at default is the principal still outstanding when it defaulted, and the recovery is the post-default cash, net of fees:

ead_at_default = (funded_amnt - total_rec_prncp).clip(lower=0)
lgd = (1 - recoveries / ead_at_default).clip(0, 1)
# result on this book: mean LGD = 0.91 (a 9% recovery rate)

That 0.91 is high because the loans are unsecured. On a secured (auto) book it would be much lower and more dispersed, and the whole allowance would shrink.

Staging and ECL. Stage 3 is impaired (31+ days past due), Stage 2 is significant-increase (arrears backstops plus a PD-based trigger), Stage 1 is the rest. Then:

ecl = np.where(
    stage == 1,
    ead * pd_12m * lgd / (1 + eir),
    np.where(
        stage == 2,
        ead * pd_life * lgd / (1 + eir) ** (rem_yrs / 2),
        ead * np.minimum(1.0, lgd) / (1 + eir) ** (rem_yrs / 2)
    )
)  # stage 3, PD=1

Result on the live book ($9.5bn EAD): ECL $1.25bn, coverage 13.1%, with coverage rising 6.6% -> 31% -> 77% across the three stages.

The gotcha worth knowing. The most consequential input is not a parameter, it is the Stage 2 trigger, because it swaps a 12-month provision for a lifetime one. I swept it:

for q in np.linspace(0.75, 0.95, 9):
    thr = act['pd_life'].quantile(q)
    stage = np.where(..., (act['pd_life'] > thr), ...)
    # total ECL ranges ~$1.10bn -> $1.37bn as you flag 5% -> 25% of the book as Stage 2

A third of a billion dollars hangs on one threshold. Worth knowing before you trust the headline number.

Part 2: a WoE scorecard, then break it

Scorecards in banking are not gradient-boosted black boxes; they are Weight-of-Evidence logistic models scaled to points, because they have to be explainable.

WoE and Information Value:

woe = np.log((dist_good + eps) / (dist_bad + eps))
iv = ((dist_good - dist_bad) * woe).sum()
# select features by IV, e.g. >= 0.02

Fit a logistic on the WoE values, then scale to points (the classic PDO formulation, 20 points to double the odds, anchored at 600 for 50:1):

factor = PDO / np.log(2)
offset = BASE - factor * np.log(BASE_ODDS)
points = -(coef * woe + intercept / n) * factor + offset / n  # per characteristic

Now the part that matters more than the build: validation. Four tests on the out-of-time sample.

Discrimination holds: Gini 0.356 out of time (0.385 in development), KS 0.256.
Calibration is where it nearly slipped. Discrimination tells you the ranking is right; it says nothing about whether the predicted probability is right. Check the level:

pred_over_observed = oot['pd'].mean() / oot['target'].mean()  # 0.77

0.77 means the model under-predicts default by ~23% on recent vintages, on every score band. Fine for ranking, not safe for pricing or ECL until recalibrated.

The trap inside the trap: PSI vs calibration. Population Stability Index checks whether the applicant mix shifted:

psi = (((dev_pct - oot_pct) * np.log((dev_pct + eps) / (oot_pct + eps)))).sum()  # 0.013, stable

PSI was a tiny 0.013. A monitor watching only PSI would flash green while the model quietly went biased, because a stable population does not mean an accurate model. Different questions; check both.

Effective challenge. Benchmark against a HistGradientBoostingClassifier on raw features: Gini 0.401 vs the scorecard's 0.356. A small lift, not enough to justify losing the transparency, so it becomes a watch item rather than a rebuild.

Verdict: approved with conditions, written up as a RAG-rated validation report. The deliverable of a second line is not a model, it is a defensible opinion.

Part 3: monitoring the live book

A point-in-time book can look healthy while deteriorating, because delinquency lags. The leading view is the vintage curve: group loans by origination year and track cumulative default by months on book.

# default timing approximated from last payment date
mob_default = last_pymnt_month - issue_month
# months on book at default
cum_default = [(coh_mob <= m).sum() / n * 100 for m in range(25)]  # per cohort

This surfaced the signal the delinquency rate hid: the 12-month-on-book default rate rose from 4.7% (2013) to 6.8% (2016), with the recent cohorts sitting above the older ones at every age.

Then scorecard drift over time (PSI by vintage vs a baseline, which climbed to 0.13), and concentration via a Herfindahl index:

shares = exposure.groupby('addr_state').sum() / exposure.sum()
hhi = (shares ** 2).sum()  # 0.051 by state (diversified); 58% in one product (watch)

Everything rolls into a RAG early-warning dashboard. The output is intentionally mixed: current losses green, vintage trend red, model drift and product concentration amber. An all-green dashboard on a quietly worsening book is the failure mode.

Gotchas, collected

Read the big gz in chunks; do not load 390MB into a single frame.
Exclude the platform's grade/rate or your model just relearns someone else's pricing.
Recent vintages are right-censored in the vintage curves; show them, but read them as partial.
Goodness-of-fit stats are useless at 500k rows (they reject on noise); use the predicted/observed ratio and a calibration plot instead.
Discrimination is not calibration, and PSI is neither. Three different questions, three different checks.
LGD of 0.91 is a feature of unsecured lending, not a bug; a secured book changes the whole picture.

The repos

Each is standalone, reproducible (pip install -r requirements.txt && python analysis.py), and ships an executed notebook plus a written report:

IFRS 9 ECL engine: github.com/gbadedata/ifrs9-ecl-engine
Scorecard + independent validation: github.com/gbadedata/pd-scorecard-validation
Portfolio monitoring + MI pack: github.com/gbadedata/credit-risk-monitoring

Built on public US consumer data; the methods carry directly to a secured book such as auto finance, where the parameters (recovery above all) would differ. If you build something similar, I would be glad to compare notes in the comments.

Read on DEV Community ↗ ← Back to News