top of page

Stata mfp in Practice: Fractional Polynomials, select(), df(), and the Dummy-Variable xi: Workaround

  • Writer: Mayta
    Mayta
  • 4 hours ago
  • 6 min read

Fractional polynomials, selection control, and using dummy variables (xi: workaround)

This article is focused only on mfp (no MI, no validation workflow), and it is written for researchers who want to:

  1. model non-linear continuous predictors in one regression model, and

  2. understand exactly what the key mfp syntaxes and options do, especially

    • select()

    • df()

    • and the dummy-variable / xi: workaround when factor-variable syntax is not accepted.


1) What problem does mfp solve?

In many regression models (logistic, Cox, linear, etc.), continuous predictors are often assumed to have a linear effect on the model scale:

  • Logistic: linearity in log-odds

  • Cox: linearity in log-hazard

  • Linear regression: linearity in mean outcome

That assumption can be wrong. Common “bad fixes” include:

  • categorizing continuous variables (information loss, arbitrary cutpoints),

  • guessing quadratic/cubic terms (too ad hoc),

  • univariable screening (unstable).

mfp provides a structured, parametric, reproducible way to:

  • test whether a continuous predictor needs a transformation, and

  • pick a transformation from a restricted family (fractional polynomials),

  • optionally with backward elimination for predictor selection.


2) What is a fractional polynomial (FP)?

The FP power set

Fractional polynomials use powers from a restricted set, typically:

with the special rule:

  • (p=0) means log(x)

FP1 vs FP2

  • FP1 uses one transformed term:


  • FP2 uses two transformed terms (two powers):

FP is not “make everything non-linear”.FP is: prove linearity is insufficient, and then choose the simplest adequate curve.

3) What mfp does internally (the “loops”)

Think of mfp as doing two tasks:

Task A — Functional form selection (shape)

For each continuous predictor, mfp compares:

  • Linear vs FP1 vs FP2using deviance / LR-type comparisons, with a controlled testing strategy (closed tests by default; sequential is an alternative).

Task B — Variable selection (optional)

It can also perform backward elimination:

  • a variable is dropped if removing it does not significantly worsen model fit (based on the select() threshold).

Why it runs in cycles

Because once one variable changes shape or drops, the “best shape” of others can change too.

So mfp iterates (“cycles”) until stable.

4) Basic mfp syntax you should memorize

mfp [, options] : regression_cmd yvar xvarlist [, regression_cmd_options]

Examples of regression_cmd include: logistic, logit, stcox, regress, etc.

Special rule: parentheses in xvarlist

You can write:

x1 x2 (x3 x4 x5)

Variables inside parentheses:

  • are tested jointly for inclusion/exclusion

  • are NOT eligible for FP transformation (they remain linear/indicator form)

This is extremely useful for dummy-variable blocks (explained below).

5) The 3 mfp command patterns you said you use frequently

Below are the three “workhorse” patterns, exactly as you use them.

Pattern 1 — Default mfp (shape selection + backward elimination)

mfp, select(0.05) : logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos

What it does

  • Tests continuous variables for FP shapes (up to default df rules).

  • Performs selection (removes predictors) using default selection levels (commonly 0.05 unless changed).

  • Returns a “final” model that may be smaller than the original.

When to use

  • Exploratory modeling

  • When you accept automatic selection

Common misinterpretation

This is not a “full model”. It is model selection + shape selection.

Pattern 2 — “Full model” (force all predictors to stay) using select(1)

mfp, select(1) : logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos

What select(1) means

  • select() is the p-to-remove threshold for backward elimination.

  • If you set it to 1, then nothing can be removed (because no p-value exceeds 1).

So:

  • ✅ all predictors remain in the model

  • ✅ FP functional-form testing still happens for continuous predictors

  • ❌ no reduction happens (even if predictors are noise)

When to use

  • When you want shape selection only, but no variable removal

Pattern 3 — Force some predictors, select others + allow more flexibility: select() + df()

mfp, select( ///
    wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos : 0.05, ///
    age hb : 1 ///
) df(age hb:4) : ///
logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos

This is the most “methods-sound” pattern when you want:

  • clinically essential covariates forced in, and

  • selection for the rest, and

  • control over allowed curve complexity for key continuous predictors.


6) Deep dive: select(varlist : p) (what it really does)

The rule

select(varlist : p) sets the variable selection threshold for that varlist.

  • If p-value for the variable (or variable block) is > p, it can be dropped.

  • If p-value is ≤ p, it stays.

How does your Pattern 3 works

You wrote:

  • age hb : 1→ force age and hb in the model (never dropped)

  • other predictors : 0.05→ use typical backward elimination at 0.05

Key conceptual point

  • select() controls inclusion/exclusion

  • It does not control whether the variable is linear or nonlinear(shape is controlled by FP testing and by df() / alpha())


7) Deep dive: df(age hb:4) (what it means in mfp)

In mfp, df() is best understood as:

maximum allowed complexity for FP modeling of that predictor

A practical mapping:

  • df(var:1) → force linear only (no FP transformation)

  • df(var:2) → allow FP1

  • df(var:4) → allow FP2 (the common default maximum)

So:

df(age hb:4)

means:

  • age and hb are allowed to be modeled as FP2 if the data support it.

Why choose higher df for specific predictors?

Because for some predictors (like age, key biomarkers):

  • you’re willing to allow curvature,

  • and you want mfp to test for it properly.

What df() does NOT mean

It does not mean “include 4 polynomial terms” or “make it a 4th-degree polynomial.”It means “allow the FP search to go up to FP2 complexity.”

8) Dummy variables and mfp (the core practical problem)

The limitation

mfp does not accept factor-variable syntax:

  • i.var

  • c.var##c.var

  • time-series operators

So researchers often see:

factor-variable and time-series operators not allowed

The correct strategy

Use either:

  1. pre-created dummy variables (best for clarity + joint testing), or

  2. xi: (quick compatibility).


9) Best practice: pre-create dummy variables + group them as a block

Why pre-created dummies are best for mfp

Because you can:

  • control the reference category,

  • preserve missingness correctly,

  • test the whole categorical predictor jointly with parentheses.

Example: 3-level thal (0=no, 1=trait, 2=disease)

gen thal_trait   = (thal==1) if !missing(thal)
gen thal_disease = (thal==2) if !missing(thal)

Then use them as a block:

mfp, select( (thal_trait thal_disease):0.05, age hb:1 ) : ///
    logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) cirrhos

What parentheses give you

(thal_trait thal_disease) is:

  • selected/dropped together

  • not FP-transformed (as it shouldn’t be)

This is the cleanest approach for manuscripts.

10) Quick compatibility: xi: mfp ... (how to use it)

If you want to keep writing i.thal i.cirrhos (old-school dummy generation), you can do:

xi: mfp, select(1) : ///
    logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib i.thal i.cirrhos

What happens

  • xi: expands i.thal and i.cirrhos into dummy variables (e.g., Ithal2, Ithal3, etc.)

  • mfp then sees only plain variables and runs

Limitation of xi: for mfp

xi: does not automatically group the generated dummies as a single block in the mfp sense. So selection can behave oddly (dropping one dummy and keeping another). That’s why manual dummies + parentheses is still preferred if you care about clean, principled selection.

11) How to interpret mfp output (the parts that matter)

In the FP selection table you will see something like:

  • “Lin. vs FP2” with deviance difference and p-value

  • “Final” with the chosen power(s)

How to read it

  • Large p-value for “Lin vs FP2” → linear is adequate

  • Small p-value → FP model improves fit → a transformation is selected

The generated variables

mfp will create variables like:

  • Iage__1, Iage__2, etc.

These are:

  • The transformed versions used in the final model

  • Typically centered (centering helps numeric stability; it does not change fit)


12) Common “researcher errors” with mfp (avoid these)

  1. Treating select() p-values as “clinical truth” Selection is a modeling choice, not a biological conclusion.

  2. Letting mfp select and then claiming performance is final Shape+selection is data-driven; it usually requires careful reporting and validation later.

  3. Using xi: and expecting perfect categorical handling xi: is a compatibility tool, not a modeling philosophy.

  4. Forgetting to group dummy variables A multi-level categorical predictor should usually be tested as a block.


13) Minimal “copy-ready” templates (logistic)

Template A — Full model, shape selection only

mfp, select(1) : logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal_trait thal_disease cir_comp cir_decomp

Template B — Force age+hb; allow selection others; allow FP2 for age+hb

mfp, select( ///
      wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) (cir_comp cir_decomp) : 0.05, ///
      age hb : 1 ///
    ) df(age hb:4) : ///
    logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) (cir_comp cir_decomp)

Template C — Quick xi: compatibility

xi: mfp, select(1) : logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib i.thal i.cirrhos


14) How to report mfp in a Methods section (short and correct)

A clean generic sentence:

“Continuous predictors were modeled using multivariable fractional polynomials (Stata mfp) with the default FP power set. Predictor inclusion was handled via backward elimination at a prespecified threshold, while clinically essential variables were forced into the model. Categorical predictors were entered using indicator (dummy) variables.”

If you used select(1) (full model):

“…with select(1) to force all candidate predictors to remain in the model while allowing FP functional-form selection.”

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page