Dummy Variables + mfp in Stata: A Practical Guide (with xi: and mfpa)

Mayta
Dec 28, 2025
5 min read

Introduction

This short “how-to” is written for researchers who hit the same wall you did:

You want multivariable fractional polynomials (MFP) for continuous predictors (non-linearity handling),
but mfp does not accept factor-variable syntax (i.var, c.var##c.var),
and it often breaks inside mi estimate unless you handle categorical variables correctly.

The solution is usually simple: pre-create dummy variables (best practice) or use xi: (quick fix). mfpa is an alternative when you need factor variables.

1) What is a dummy variable?

A dummy variable is a 0/1 indicator representing membership in a category.

For a binary predictor (e.g., male/female), one dummy is enough.
For a k-level categorical predictor (e.g., 3 levels), you need k−1 dummies (choose a reference level). This avoids perfect collinearity (“dummy variable trap”).

Key rule: preserve missingness

In Stata, expressions like (thal==1) will return 0 when thal is missing (because . is not equal to 1). That silently misclassifies missing values.

✅ Always do:

gen thal_trait = (thal==1) if !missing(thal)
replace thal_trait = . if missing(thal)

Or simpler (same effect):

gen thal_trait = .
replace thal_trait = 1 if thal==1
replace thal_trait = 0 if thal!=. & thal!=1

2) How to create dummy variables (three common methods)

Method A — Manual gen (most transparent; best for papers)

Example: thal has three levels: 0 = no, 1 = trait, 2 = disease. Reference = “no thalassemia” (0). Create two dummies:

gen thal_trait   = (thal==1) if !missing(thal)
gen thal_disease = (thal==2) if !missing(thal)

label var thal_trait   "Thalassemia trait"
label var thal_disease "Thalassemia disease"

Example: cirrhosis has 3 levels: 0=no, 1=compensated, 2=decompensatedReference = 0:

gen cir_comp = (cirrhos==1) if !missing(cirrhos)
gen cir_decomp = (cirrhos==2) if !missing(cirrhos)

label var cir_comp   "Compensated cirrhosis"
label var cir_decomp "Decompensated cirrhosis"

✅ This is the easiest to interpret and to report.

Method B — tab var, gen(prefix) (fast)

tab thal, gen(thal_)

Stata will create thal_1, thal_2, thal_3 ... based on the levels it sees.

You must choose a reference and drop one dummy (or omit it from the model).

Also check missingness handling—tab, gen() does not always behave the way you want with missing categories, so always verify:

misstable sum thal thal_*
tab thal, missing

Method C — xi: prefix (automatic expansion; quick fix)

xi: expands i.var into dummy variables before the model runs.

Example:

xi: logistic group_gimalig age hb i.thal i.cirrhos

Creates internal variables like Ithal1, Ithal2 ...

✅ Useful when you want speed ❌ Less readable output and harder to control references ❌ For prediction modeling / TRIPOD reporting, manual dummies are usually preferred.

3) Why this matters for mfp

The limitation

mfp does NOT accept factor-variable or time-series operators.So these will fail:

mfp ... i.thal
mfp ... c.age##c.age
mfp ... i.site##c.age
and often inside MI: mi estimate: mfp ... i.var

Typical error:

factor-variable and time-series operators not allowed

4) How to use mfp correctly with dummy variables

The syntax pattern (what mfp expects)

mfp [options] : regression_cmd y xvarlist

Important points:

mfp tests continuous variables for FP transformations (shape).
mfp can also do variable selection (backward elimination).
Variables inside parentheses ( ... ):
- are tested jointly for inclusion,
- and are not eligible for FP transformation (which is fine for categorical blocks).

That “joint testing” is very important for dummy sets.

5) Three mfp patterns you said you use frequently

Pattern 1 — Default MFP (shape + selection)

mfp logistic group_gimalig ///
    age hb wbc plt mcv rdw ferritin si ///
    male pain wtloss abnbm gib ///
    thal_trait thal_disease ///
    cir_comp cir_decomp

What it does:

Tests each continuous predictor for FP shape (linear vs FP1 vs FP2),
and may remove predictors based on default selection rules.

Best used for:

exploration
initial model building

Not ideal if you need a “full” pre-specified model.

Pattern 2 — Full model (no variable removal): select(1)

mfp, select(1) : logistic group_gimalig ///
    age hb wbc plt mcv rdw ferritin si ///
    male pain wtloss abnbm gib ///
    thal_trait thal_disease ///
    cir_comp cir_decomp

What it does:

Forces every predictor to remain (p-to-remove = 1.0),
still chooses FP shapes for continuous variables.

Best used for:

prediction modeling, where you want a full model but allow non-linearity

Pattern 3 — Force some predictors + allow selection in others

Example: force age and hb in the model, but allow selection of the rest:

mfp, select( ///
      wbc plt mcv rdw ferritin si male pain wtloss abnbm gib ///
      (thal_trait thal_disease) (cir_comp cir_decomp) : 0.05, ///
      age hb : 1 ///
    ) df(age hb:4) : ///
    logistic group_gimalig ///
      age hb wbc plt mcv rdw ferritin si ///
      male pain wtloss abnbm gib ///
      (thal_trait thal_disease) (cir_comp cir_decomp)

What it does (conceptually):

Selection rule:
- age hb:1 → forced in (never dropped)
- others → can be dropped at 0.05
- dummy blocks in parentheses are tested jointly (important!)
Shape rule:
- df(age hb:4) allows more flexible FP shapes for age and hb (up to 4 df).

Why parentheses matter here:

Without parentheses, mfp might drop only thal_disease but keep thal_trait, which is usually undesirable clinically/statistically.
With (thal_trait thal_disease) the whole “thalassemia factor” stays or goes together.

6) Using xi: with mfp (quick compatibility mode)

If you don’t want to manually create dummies, this usually works:

xi: mfp, select(1) : logistic group_gimalig ///
    age hb wbc plt mcv rdw ferritin si ///
    male pain wtloss abnbm gib ///
    i.thal i.cirrhos

Why it works

xi: expands i.thal i.cirrhos to dummy variables first.
mfp then sees only plain variable names (no i. syntax), so it runs.

When I recommend xi: vs manual dummies

Use xi:: quick exploratory work, teaching, or when dummies are numerous and you don’t care about clean names.
Use manual dummies: final models, reporting, validation workflows, and reproducible manuscripts.

7) MI error: “mfp on m=1” and how to fix it

You mentioned:

“an error occurred when mi estimate executed mfp on m=1”

This usually happens because:

mi estimate is calling a command line that still contains i. or c. syntax (directly or indirectly),
and mfp refuses it.

Fix 1 (best practice): create dummies as passive under MI

If thal or cirrhos are imputed, your dummies should be passive (derived from the imputed parent variable):

mi passive: gen thal_trait   = (thal==1) if !missing(thal)
mi passive: gen thal_disease = (thal==2) if !missing(thal)

mi passive: gen cir_comp   = (cirrhos==1) if !missing(cirrhos)
mi passive: gen cir_decomp = (cirrhos==2) if !missing(cirrhos)

Then:

mi estimate: mfp, select(1) : logistic group_gimalig ///
    age hb wbc plt mcv rdw ferritin si ///
    male pain wtloss abnbm gib ///
    thal_trait thal_disease cir_comp cir_decomp

Fix 2 (quick fix): xi: inside MI

Often works:

mi estimate: xi: mfp, select(1) : logistic group_gimalig ///
    age hb wbc plt mcv rdw ferritin si ///
    male pain wtloss abnbm gib ///
    i.thal i.cirrhos

If it runs, it’s fine for teaching and quick work. For publication, I still prefer passive dummies because it’s explicit and reproducible.

8) Where mfpa fits (and why people use it)

What mfpa adds

Supports factor variables (so i.var works).
Adds linadj() and acd() options (advanced extensions).
Uses different post-estimation helpers (xfracplot, xfracpred).

When to use mfpa

You truly need factor-variable syntax and interactions in the model-building step.
You accept the additional complexity and package dependencies.

Why some people avoid mfpa for performance validation

Not because validation is impossible—you can still:

predict p, pr
lroc
pmcalplot

…but because:

your model-building procedure is more complex to replicate inside bootstrap/MI pipelines, and
fewer teams are familiar with the post-estimation ecosystem.

For many clinical papers, manual dummies + official mfp remains the cleanest “standard”.

Practical checklist (what to do in real life)

If you use mfp and have categorical variables:

✅ Create dummies manually (best) OR use xi: (quick)

For multi-level categorical predictors:

✅ Use k−1 dummies

✅ Consider testing as a block: (d1 d2 ... dk-1)

Under MI:

✅ Make dummies passive

or ✅ use mi estimate: xi: ... as a shortcut

If you see:

factor-variable and time-series operators not allowed

✅ Remove i. / c. / ##
✅ Replace with dummies or xi: