Simplifying Clinical Prediction Models: Correlation, Complexity & Fit in Stata (R², R square, and Predictor Parameters)
- Mayta
- Jun 11
- 3 min read
✨ Abstract
In the design of clinical prediction models—especially those involving continuous predictors—understanding the interplay between correlation among variables, the use of nonlinear terms, and model selection criteria is pivotal. This guide clarifies when and why we should simplify models, how predictor parameters affect statistical power, and how to decide whether adding complexity like quadratic terms is justified. Visual diagnostics and metrics like Adjusted R², AIC, and BIC are integrated into the analytic strategy.
🧩 Part 1: Why Predictor Correlation Matters
When multiple predictors like BMI, weight, and waist circumference are highly correlated, they may represent overlapping biological constructs. Including all of them:
Inflates standard errors of regression coefficients.
Creates multicollinearity, which weakens interpretation.
Makes the model prone to overfitting, especially in smaller samples.
📌 STATA Syntax to Check:
stata: corr age weight waist bmi
📉 High correlation (e.g., > 0.8) among predictors? → Consider dropping or combining.
📈 Part 2: What Are Predictor Parameters, and Why “More” Isn’t Always Better?
In a regression model like:
Each β is a predictor parameter.
More predictors = more parameters to estimate.
Every added parameter costs statistical power and increases model variance.
📌 Only keep predictors that meaningfully contribute new information to the model.
🔁 Part 3: Linear vs. Quadratic – When to Go Nonlinear
Sometimes, predictor-outcome relationships are not straight lines. Quadratic terms (x²) allow the model to capture curved relationships (e.g., optimal health at middle weight, worse at extremes).
📊 STATA Visual Diagnostic:
stata tw (lowess dxa_totalpercentfat weight) /// (lfit dxa_totalpercentfat weight) /// (qfit dxa_totalpercentfat weight), /// title("DXA % Fat vs Weight: LOWESS vs Linear vs Quadratic")
Use this to visually decide if a quadratic term is justified:
If qfit ≈ lowess but lfit deviates → add quadratic.
If qfit adds no new pattern → skip it.
🔬 Part 4: Evaluating Model Fit—Don’t Rely on R² Alone
❌ Raw R²
Measures % of variance explained.
Increases automatically with more predictors—even if they're junk.
✅ Adjusted R²
Adjusts for the number of predictors.
Use this when comparing models of different complexity.
✅ AIC (Akaike Information Criterion)
Penalizes models for complexity.
Lower = better model fit + parsimony.
✅ BIC (Bayesian Information Criterion)
Like AIC, but more penalty for complexity.
Preferred when the sample size is large.
📌 STATA Model Comparison:
stata: regress dxa_totalpercentfat age weight estat ic // shows AIC, BIC regress dxa_totalpercentfat age weight c.weight#c.weight estat ic // compare again
📉 If AIC/BIC drops with quadratic term → consider keeping it.
🎯 Summary Chart
Metric | Meaning | Rule |
R² | Variance explained | Higher = better (but misleading) |
Adjusted R² | R² adjusted for parameters | Higher = better if added terms help |
AIC | Fit + complexity penalty | Lower = better |
BIC | Fit + harsher complexity penalty | Lower = better (esp. large n) |
🔚 Conclusion
Building strong clinical prediction models means:
Trimming redundant predictors.
Judiciously adding complexity (quadratic terms) only if justified.
Using Adjusted R², AIC, and BIC to prevent overfitting.
This approach ensures your model is clinically interpretable, statistically valid, and ready for external validation.
📉 The Problem
Even if R² increases, this model:
Uses more degrees of freedom
Makes interpretation harder (e.g., what does weight² mean clinically?)
Adds instability (especially in smaller datasets)
Dilutes the meaning of the intercept (_cons)
So yes: You are completely correct to say:
“We can use 1st-degree if the increase in R² doesn't justify the jump in predictor parameters (βs).”
✅ Rule of Thumb for Your Statement
If... | Then... |
R² ↑ but only a little | Ignore unless huge gain |
Predictor parameters ↑↑ | ⚠️ More power used, more noise |
Constant (_cons) shifts big | ⚠️ Model becoming unstable |
Clinical interpretability ↓ | ⚠️ Simplicity is better |
🔚 Conclusion
you can (and should) use 2nd-degree terms if:
LOWESS or residual plots suggest nonlinearity.
AIC/BIC significantly drop with quadratic.
There’s a biological/clinical rationale for a curved effect.
Comments