Penalisation and Regularisation in Clinical Prediction Models Explained (CPMs): Why “shrinkage” not “swelling.”

Mayta
Jan 24
6 min read

Introduction

Regularisation (also called penalisation or shrinkage) is a modelling strategy used when developing CPMs to reduce overfitting and improve performance in new patients. In practice, it does this by discouraging overly large coefficients—i.e., it makes “complexity” costly.

In CPM development roadmaps, penalisation is positioned within Model Derivation (model fitting) as an alternative to traditional variable selection approaches, particularly when the number of candidate predictors is large relative to sample size (high p/n).

The core math idea: “fit well” + “pay a complexity tax.”

1) Ordinary (unpenalised) model fitting

Model fitting means choosing coefficients (\beta) to minimise a loss function.

Linear regression (RSS loss):

Logistic regression (negative log-likelihood loss) — common in CPMs with a binary outcome:

where

2) Penalised (regularised) model fitting

Regularisation modifies the objective by adding a penalty term:

Loss(β): misfit to the development data (RSS or negative log-likelihood)
Penalty(β): a “cost” for coefficient magnitude/complexity
λ: tuning parameter controlling penalty strength
- (λ=0) → ordinary model
- larger (λ) → stronger shrinkage

(Common convention: the intercept (β₀) is not penalised.)

Continuous and categorical predictors: what “coefficients” really mean

Regularisation acts on coefficients, so it’s essential to understand how predictors create coefficients.

Continuous predictors

A continuous predictor usually contributes one coefficient if entered linearly (e.g., age). But CPM guidance warns against dichotomising continuous variables (e.g., “age ≥ 65”) because it throws away information; instead, consider flexible forms such as splines/polynomials when needed.

https://video.wixstatic.com/video/0cd34d_b9b011b21aaf4c0cb14ac35e5a606b4e/720p/mp4/file.mp4

Categorical predictors (3+ levels)

A categorical predictor typically creates multiple coefficients through indicator (dummy) coding.

If (Z) has (K) categories, choosing one reference category produces (K-1) dummy variables, hence (K-1) coefficients for that single predictor. In CPM practice, rare categories should often be combined to avoid sparse data problems.

So: regularisation doesn’t penalise “a variable” in the abstract—it penalises the set of coefficients created by how that variable is encoded.

Suppose you have a categorical predictor Z with K categories, e.g.

Z ∈ {1,2,…,K}

To use it in regression, you encode it into dummy/indicator variables (one-hot encoding with a reference category).

Reference-category (most common)

Pick a reference level (say level 1). Create K−1dummies:

Then the linear predictor becomes:

So one categorical predictor with K levels contributes K−1 coefficients.

https://video.wixstatic.com/video/0cd34d_969fa8bbb35b42cfa47985e03e82ab06/720p/mp4/file.mp4

Interpretation in logistic regression

For a binary outcome, CPM (logistic regression):

Each βk\beta_kβk is the log-odds difference for category kkk vs the reference category. Odds ratio for level kkk vs reference is:

Ridge (L2) regularisation

Objective function

What L2 does conceptually

Shrinks all coefficients toward 0 (reducing sensitivity to noise)
Usually does not set coefficients exactly to 0, so it tends to keep all predictors (but with smaller effects)
Often stabilises estimates when predictors are correlated

LASSO (L1) regularisation

Objective function

What L1 does conceptually

Shrinks coefficients toward 0
Can shrink some coefficients exactly to 0, creating a sparser model (automatic variable selection)

Important for categorical predictors: because a factor with (K) levels has (K-1) dummy coefficients, standard LASSO can set some level-coefficients to 0 while leaving others nonzero. That may be fine, but it means selection may occur at the level rather than the whole variable level.

Penalisation (including LASSO) is explicitly highlighted as an option for variable selection during model derivation in CPM workflows.

Elastic Net (L1 + L2): best of both worlds

Objective function

A common parameterisation uses α ∈ [0,1] :

α =1: LASSO
α = 0: Ridge
0 < α < 1: Elastic Net

Why it exists

Keeps LASSO’s ability to drive some coefficients to 0 (sparsity)
Adds Ridge’s stabilising effect (helpful with correlated predictors)
Noted in CPM notes as a penalisation option for high p/n settings

Where this fits in the CPM methodology

In the CPM development roadmap:

Predictor strategy includes careful handling of categorical predictors (combine rare levels) and continuous predictors (avoid dichotomising; consider splines/polynomials).
Model derivation is where penalisation (LASSO/Elastic Net) is applied as part of variable selection / fitting.
Performance evaluation should still assess discrimination, calibration, and clinical usefulness (e.g., net benefit/decision curve).

Before we go

Regularisation = minimise (misfit + penalty), where (\lambda) controls the strength of shrinkage.
Continuous predictors: avoid dichotomising; consider flexible forms if needed.
Categorical predictors create multiple coefficients, so penalties act on those coefficients; rare levels may need to be combined.
Ridge (L2) shrinks; LASSO (L1) shrinks and can set some coefficients to 0; Elastic Net mixes both and is suggested for high p/n.

Penalisation / Regularization

Penalisation (also called regularization or shrinkage) means adding a penalty term to the model during fitting. Modern clinical prediction models often recommend this approach because it helps control overfitting and makes the model more stable and generalizable, even when you have many predictors.

Regularisation (Shrinkage) — what it does

Regularisation adds a penalty to the optimisation target, so the model is not rewarded for using very large coefficients.

The penalty pushes (shrinks) predictor coefficients to be smaller.
Smaller coefficients usually mean the model is less sensitive to random noise in the development dataset.
This reduces overfitting and improves performance on new patients.

A tuning parameter λ (lambda) controls how strong the penalty is:

Small λ → weak penalty → model behaves closer to ordinary regression
Large λ → strong penalty → coefficients shrink more (too large can cause underfitting)

In practice, you choose λ using cross-validation (try multiple λ values and keep the one that predicts best on unseen folds).

Three main regularisation methods (no equations)

1) Ridge Regression (L2 penalty)

Penalizes the squared size of coefficients.
Shrinks coefficients toward zero, but usually does not remove predictors completely.
Best when predictors are highly correlated, and you want to keep them all, just with more conservative effect sizes.

2) LASSO Regression (L1 penalty)

Penalises the absolute size of coefficients.
Can shrink some coefficients exactly to zero, effectively dropping predictors from the model.
Gives you shrinkage + variable selection in one method.
Best when you have many predictors and want a simpler model that keeps only the strongest contributors.

3) Elastic Net

A hybrid of Ridge and LASSO.
Useful when predictors come in correlated groups:
- It can still remove unhelpful predictors (like LASSO),
- but it’s less likely to “randomly pick just one” from a correlated group (a common LASSO issue).
Often preferred when you have many predictors and correlations are common.

Football manager analogy (same style as before)

Imagine you are a football manager building a team from hundreds of players (your predictors). Your goal is not to win one friendly match (fit the training data), but to win the whole season (generalise to new patients).

Ordinary regression (no penalty)

“You can buy anyone at any price as long as you win today.”

You end up paying huge money for flashy players based on short-term highlights.
The squad looks brilliant in one match, but falls apart against new opponents. ➡️ This is overfitting: big coefficients, unstable decisions, poor generalisation.

Ridge (L2)

“You may sign everyone, but there’s a luxury tax that grows rapidly for expensive stars.”

You keep the full squad,
but you reduce everyone’s wage demands and prevent extreme spending on any single star. ➡️ Coefficients get smaller, but almost nobody is cut.

LASSO (L1)

“Every player costs a fixed registration fee—same fee per person.”

If a player adds only a little value, they are not worth the fee.
You cut them immediately. ➡️ Some coefficients become exactly zero: true variable selection.

Elastic Net

“You have both rules: a registration fee and a luxury tax.”

You still cut genuinely weak players (like LASSO),
but if two players are a great pair and always play well together (correlated predictors), you don’t cruelly separate them. ➡️ You get a compact team, but you keep important groups intact.

Why “shrinkage” (not “swelling”)?

Because in CPM development, the default problem is that coefficients already tend to be too large / too extreme when you fit a model on limited data with many predictors. That “coefficient inflation” is basically what overfitting looks like in regression.

Here’s the logic in plain:

The fitting algorithm is rewarded for being extreme. In the development dataset, if you let coefficients grow, the model can “explain” not only the real signal but also the random noise. Training fit improves, but it’s fake improvement.
Swelling = higher variance = less stability. Inflated (“swollen”) coefficients swing a lot if you change the sample a bit. That’s high variance, and it makes the model unreliable in new patients.
Shrinkage is a controlled correction. Regularisation deliberately pulls coefficients toward 0, making them less extreme. You accept a little bias, but you gain a lot in stability and generalisability—exactly what you want for prediction.
Overfitting shows up as poor calibration in new data. Overfit models tend to give predictions that are “too confident” (probabilities too close to 0 or 1). CPM workflow explicitly highlights calibration checks (intercept/slope) and notes overfitting as a key reason performance drops at external validation; recalibration may then adjust slope/intercept.

The “bias–variance trade” in one sentence

Shrinkage works because it adds a bit of bias to cut variance a lot, which usually reduces overall prediction error on new patients.

Why not just “boost/increase” coefficients instead?

Because increasing coefficients would usually:

make the model more tailored to the quirks of the development dataset,
increase variance,
worsen generalisation,
and often worsen calibration in new data.

How do we decide how much to shrink?

You don’t pick it by feeling—you tune the penalty strength (λ) using internal validation, commonly cross-validation (or bootstrapping).

Bottom line: In CPMs, the usual enemy is inflated (swollen) coefficients from overfitting, so the fix is shrinkage, not swelling.

Penalisation and Regularisation in Clinical Prediction Models Explained (CPMs): Why “shrinkage” not “swelling.”

Introduction

The core math idea: “fit well” + “pay a complexity tax.”

1) Ordinary (unpenalised) model fitting

2) Penalised (regularised) model fitting

Continuous and categorical predictors: what “coefficients” really mean

Continuous predictors

Categorical predictors (3+ levels)

Reference-category (most common)

Interpretation in logistic regression

Ridge (L2) regularisation

Objective function

What L2 does conceptually

LASSO (L1) regularisation

Objective function

What L1 does conceptually

Elastic Net (L1 + L2): best of both worlds

Objective function

Why it exists

Where this fits in the CPM methodology

Before we go

Penalisation / Regularization

Penalisation / Regularization

Regularisation (Shrinkage) — what it does

Three main regularisation methods (no equations)

1) Ridge Regression (L2 penalty)

2) LASSO Regression (L1 penalty)

3) Elastic Net

Football manager analogy (same style as before)

Ordinary regression (no penalty)

Ridge (L2)

LASSO (L1)

Elastic Net

Why “shrinkage” (not “swelling”)?

The “bias–variance trade” in one sentence

Why not just “boost/increase” coefficients instead?

How do we decide how much to shrink?

Recent Posts

Comments