Simplifying Clinical Prediction Models: Correlation, Complexity & Fit in Stata (R², R square, and Predictor Parameters)

Mayta
Jun 11, 2025
3 min read

✨ Abstract

In the design of clinical prediction models—especially those involving continuous predictors—understanding the interplay between correlation among variables, the use of nonlinear terms, and model selection criteria is pivotal. This guide clarifies when and why we should simplify models, how predictor parameters affect statistical power, and how to decide whether adding complexity like quadratic terms is justified. Visual diagnostics and metrics like Adjusted R², AIC, and BIC are integrated into the analytic strategy.

🧩 Part 1: Why Predictor Correlation Matters

When multiple predictors like BMI, weight, and waist circumference are highly correlated, they may represent overlapping biological constructs. Including all of them:

Inflates standard errors of regression coefficients.
Creates multicollinearity, which weakens interpretation.
Makes the model prone to overfitting, especially in smaller samples.

📌 STATA Syntax to Check:

stata: corr age weight waist bmi

📉 High correlation (e.g., > 0.8) among predictors? → Consider dropping or combining.

📈 Part 2: What Are Predictor Parameters, and Why “More” Isn’t Always Better?

In a regression model like:

Each β is a predictor parameter.
More predictors = more parameters to estimate.
Every added parameter costs statistical power and increases model variance.

📌 Only keep predictors that meaningfully contribute new information to the model.

🔁 Part 3: Linear vs. Quadratic – When to Go Nonlinear

Sometimes, predictor-outcome relationships are not straight lines. Quadratic terms (x²) allow the model to capture curved relationships (e.g., optimal health at middle weight, worse at extremes).

📊 STATA Visual Diagnostic:

stata tw (lowess dxa_totalpercentfat weight) /// (lfit dxa_totalpercentfat weight) /// (qfit dxa_totalpercentfat weight), /// title("DXA % Fat vs Weight: LOWESS vs Linear vs Quadratic")

Use this to visually decide if a quadratic term is justified:

If qfit ≈ lowess but lfit deviates → add quadratic.
If qfit adds no new pattern → skip it.

🔬 Part 4: Evaluating Model Fit—Don’t Rely on R² Alone

❌ Raw R²

Measures % of variance explained.
Increases automatically with more predictors—even if they're junk.

✅ Adjusted R²

Adjusts for the number of predictors.
Use this when comparing models of different complexity.

✅ AIC (Akaike Information Criterion)

Penalizes models for complexity.
Lower = better model fit + parsimony.

✅ BIC (Bayesian Information Criterion)

Like AIC, but more penalty for complexity.
Preferred when the sample size is large.

📌 STATA Model Comparison:

stata: regress dxa_totalpercentfat age weight estat ic // shows AIC, BIC regress dxa_totalpercentfat age weight c.weight#c.weight estat ic // compare again

📉 If AIC/BIC drops with quadratic term → consider keeping it.

🎯 Summary Chart

Metric	Meaning	Rule
R²	Variance explained	Higher = better (but misleading)
Adjusted R²	R² adjusted for parameters	Higher = better if added terms help
AIC	Fit + complexity penalty	Lower = better
BIC	Fit + harsher complexity penalty	Lower = better (esp. large n)

🔚 Conclusion

Building strong clinical prediction models means:

Trimming redundant predictors.
Judiciously adding complexity (quadratic terms) only if justified.
Using Adjusted R², AIC, and BIC to prevent overfitting.

This approach ensures your model is clinically interpretable, statistically valid, and ready for external validation.

📉 The Problem

Even if R² increases, this model:

Uses more degrees of freedom
Makes interpretation harder (e.g., what does weight² mean clinically?)
Adds instability (especially in smaller datasets)
Dilutes the meaning of the intercept (_cons)

So yes: You are completely correct to say:

“We can use 1st-degree if the increase in R² doesn't justify the jump in predictor parameters (βs).”

✅ Rule of Thumb for Your Statement

If...	Then...
R² ↑ but only a little	Ignore unless huge gain
Predictor parameters ↑↑	⚠️ More power used, more noise
Constant (_cons) shifts big	⚠️ Model becoming unstable
Clinical interpretability ↓	⚠️ Simplicity is better

🔚 Conclusion

you can (and should) use 2nd-degree terms if:

LOWESS or residual plots suggest nonlinearity.
AIC/BIC significantly drop with quadratic.
There’s a biological/clinical rationale for a curved effect.

Simplifying Clinical Prediction Models: Correlation, Complexity & Fit in Stata (R², R square, and Predictor Parameters)

✨ Abstract

🧩 Part 1: Why Predictor Correlation Matters

📌 STATA Syntax to Check:

📈 Part 2: What Are Predictor Parameters, and Why “More” Isn’t Always Better?

🔁 Part 3: Linear vs. Quadratic – When to Go Nonlinear

📊 STATA Visual Diagnostic:

🔬 Part 4: Evaluating Model Fit—Don’t Rely on R² Alone

❌ Raw R²

✅ Adjusted R²

✅ AIC (Akaike Information Criterion)

✅ BIC (Bayesian Information Criterion)

📌 STATA Model Comparison:

🎯 Summary Chart

🔚 Conclusion

📉 The Problem

✅ Rule of Thumb for Your Statement

🔚 Conclusion

Recent Posts

Comments