top of page

Simplifying Clinical Prediction Models: Correlation, Complexity & Fit in Stata (R², R square, and Predictor Parameters)

✨ Abstract

In the design of clinical prediction models—especially those involving continuous predictors—understanding the interplay between correlation among variables, the use of nonlinear terms, and model selection criteria is pivotal. This guide clarifies when and why we should simplify models, how predictor parameters affect statistical power, and how to decide whether adding complexity like quadratic terms is justified. Visual diagnostics and metrics like Adjusted R², AIC, and BIC are integrated into the analytic strategy.

🧩 Part 1: Why Predictor Correlation Matters

When multiple predictors like BMI, weight, and waist circumference are highly correlated, they may represent overlapping biological constructs. Including all of them:

  • Inflates standard errors of regression coefficients.

  • Creates multicollinearity, which weakens interpretation.

  • Makes the model prone to overfitting, especially in smaller samples.

📌 STATA Syntax to Check:

stata: corr age weight waist bmi

📉 High correlation (e.g., > 0.8) among predictors? → Consider dropping or combining.

📈 Part 2: What Are Predictor Parameters, and Why “More” Isn’t Always Better?

In a regression model like:


  • Each β is a predictor parameter.

  • More predictors = more parameters to estimate.

  • Every added parameter costs statistical power and increases model variance.

📌 Only keep predictors that meaningfully contribute new information to the model.

🔁 Part 3: Linear vs. Quadratic – When to Go Nonlinear

Sometimes, predictor-outcome relationships are not straight lines. Quadratic terms (x²) allow the model to capture curved relationships (e.g., optimal health at middle weight, worse at extremes).

📊 STATA Visual Diagnostic:

stata tw (lowess dxa_totalpercentfat weight) /// (lfit dxa_totalpercentfat weight) /// (qfit dxa_totalpercentfat weight), /// title("DXA % Fat vs Weight: LOWESS vs Linear vs Quadratic")

Use this to visually decide if a quadratic term is justified:

  • If qfit ≈ lowess but lfit deviates → add quadratic.

  • If qfit adds no new pattern → skip it.

🔬 Part 4: Evaluating Model Fit—Don’t Rely on R² Alone

❌ Raw R²

  • Measures % of variance explained.

  • Increases automatically with more predictors—even if they're junk.

✅ Adjusted R²

  • Adjusts for the number of predictors.

  • Use this when comparing models of different complexity.

✅ AIC (Akaike Information Criterion)

  • Penalizes models for complexity.

  • Lower = better model fit + parsimony.

✅ BIC (Bayesian Information Criterion)

  • Like AIC, but more penalty for complexity.

  • Preferred when the sample size is large.

📌 STATA Model Comparison:

stata: regress dxa_totalpercentfat age weight estat ic // shows AIC, BIC regress dxa_totalpercentfat age weight c.weight#c.weight estat ic // compare again

📉 If AIC/BIC drops with quadratic term → consider keeping it.

🎯 Summary Chart

Metric

Meaning

Rule

Variance explained

Higher = better (but misleading)

Adjusted R²

R² adjusted for parameters

Higher = better if added terms help

AIC

Fit + complexity penalty

Lower = better

BIC

Fit + harsher complexity penalty

Lower = better (esp. large n)


🔚 Conclusion

Building strong clinical prediction models means:

  • Trimming redundant predictors.

  • Judiciously adding complexity (quadratic terms) only if justified.

  • Using Adjusted R², AIC, and BIC to prevent overfitting.

This approach ensures your model is clinically interpretable, statistically valid, and ready for external validation.

📉 The Problem

Even if R² increases, this model:

  • Uses more degrees of freedom

  • Makes interpretation harder (e.g., what does weight² mean clinically?)

  • Adds instability (especially in smaller datasets)

  • Dilutes the meaning of the intercept (_cons)

So yes: You are completely correct to say:

“We can use 1st-degree if the increase in R² doesn't justify the jump in predictor parameters (βs).”

✅ Rule of Thumb for Your Statement

If...

Then...

R² ↑ but only a little

Ignore unless huge gain

Predictor parameters ↑↑

⚠️ More power used, more noise

Constant (_cons) shifts big

⚠️ Model becoming unstable

Clinical interpretability ↓

⚠️ Simplicity is better

🔚 Conclusion

you can (and should) use 2nd-degree terms if:

  • LOWESS or residual plots suggest nonlinearity.

  • AIC/BIC significantly drop with quadratic.

  • There’s a biological/clinical rationale for a curved effect.

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page