Stata mfp in Practice: Fractional Polynomials, select(), df(), and the Dummy-Variable xi: Workaround
- Mayta

- 4 hours ago
- 6 min read
Fractional polynomials, selection control, and using dummy variables (xi: workaround)
This article is focused only on mfp (no MI, no validation workflow), and it is written for researchers who want to:
model non-linear continuous predictors in one regression model, and
understand exactly what the key mfp syntaxes and options do, especially
select()
df()
and the dummy-variable / xi: workaround when factor-variable syntax is not accepted.
1) What problem does mfp solve?
In many regression models (logistic, Cox, linear, etc.), continuous predictors are often assumed to have a linear effect on the model scale:
Logistic: linearity in log-odds
Cox: linearity in log-hazard
Linear regression: linearity in mean outcome
That assumption can be wrong. Common “bad fixes” include:
categorizing continuous variables (information loss, arbitrary cutpoints),
guessing quadratic/cubic terms (too ad hoc),
univariable screening (unstable).
mfp provides a structured, parametric, reproducible way to:
test whether a continuous predictor needs a transformation, and
pick a transformation from a restricted family (fractional polynomials),
optionally with backward elimination for predictor selection.
2) What is a fractional polynomial (FP)?
The FP power set
Fractional polynomials use powers from a restricted set, typically:
with the special rule:
(p=0) means log(x)
FP1 vs FP2
FP1 uses one transformed term:
FP2 uses two transformed terms (two powers):
FP is not “make everything non-linear”.FP is: prove linearity is insufficient, and then choose the simplest adequate curve.
3) What mfp does internally (the “loops”)
Think of mfp as doing two tasks:
Task A — Functional form selection (shape)
For each continuous predictor, mfp compares:
Linear vs FP1 vs FP2using deviance / LR-type comparisons, with a controlled testing strategy (closed tests by default; sequential is an alternative).
Task B — Variable selection (optional)
It can also perform backward elimination:
a variable is dropped if removing it does not significantly worsen model fit (based on the select() threshold).
Why it runs in cycles
Because once one variable changes shape or drops, the “best shape” of others can change too.
So mfp iterates (“cycles”) until stable.
4) Basic mfp syntax you should memorize
mfp [, options] : regression_cmd yvar xvarlist [, regression_cmd_options]
Examples of regression_cmd include: logistic, logit, stcox, regress, etc.
Special rule: parentheses in xvarlist
You can write:
x1 x2 (x3 x4 x5)
Variables inside parentheses:
are tested jointly for inclusion/exclusion
are NOT eligible for FP transformation (they remain linear/indicator form)
This is extremely useful for dummy-variable blocks (explained below).
5) The 3 mfp command patterns you said you use frequently
Below are the three “workhorse” patterns, exactly as you use them.
Pattern 1 — Default mfp (shape selection + backward elimination)
mfp, select(0.05) : logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos
What it does
Tests continuous variables for FP shapes (up to default df rules).
Performs selection (removes predictors) using default selection levels (commonly 0.05 unless changed).
Returns a “final” model that may be smaller than the original.
When to use
Exploratory modeling
When you accept automatic selection
Common misinterpretation
This is not a “full model”. It is model selection + shape selection.
Pattern 2 — “Full model” (force all predictors to stay) using select(1)
mfp, select(1) : logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos
What select(1) means
select() is the p-to-remove threshold for backward elimination.
If you set it to 1, then nothing can be removed (because no p-value exceeds 1).
So:
✅ all predictors remain in the model
✅ FP functional-form testing still happens for continuous predictors
❌ no reduction happens (even if predictors are noise)
When to use
When you want shape selection only, but no variable removal
Pattern 3 — Force some predictors, select others + allow more flexibility: select() + df()
mfp, select( ///
wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos : 0.05, ///
age hb : 1 ///
) df(age hb:4) : ///
logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal cirrhos
This is the most “methods-sound” pattern when you want:
clinically essential covariates forced in, and
selection for the rest, and
control over allowed curve complexity for key continuous predictors.
6) Deep dive: select(varlist : p) (what it really does)
The rule
select(varlist : p) sets the variable selection threshold for that varlist.
If p-value for the variable (or variable block) is > p, it can be dropped.
If p-value is ≤ p, it stays.
How does your Pattern 3 works
You wrote:
age hb : 1→ force age and hb in the model (never dropped)
other predictors : 0.05→ use typical backward elimination at 0.05
Key conceptual point
select() controls inclusion/exclusion
It does not control whether the variable is linear or nonlinear(shape is controlled by FP testing and by df() / alpha())
7) Deep dive: df(age hb:4) (what it means in mfp)
In mfp, df() is best understood as:
maximum allowed complexity for FP modeling of that predictor
A practical mapping:
df(var:1) → force linear only (no FP transformation)
df(var:2) → allow FP1
df(var:4) → allow FP2 (the common default maximum)
So:
df(age hb:4)
means:
age and hb are allowed to be modeled as FP2 if the data support it.
Why choose higher df for specific predictors?
Because for some predictors (like age, key biomarkers):
you’re willing to allow curvature,
and you want mfp to test for it properly.
What df() does NOT mean
It does not mean “include 4 polynomial terms” or “make it a 4th-degree polynomial.”It means “allow the FP search to go up to FP2 complexity.”
8) Dummy variables and mfp (the core practical problem)
The limitation
mfp does not accept factor-variable syntax:
i.var
c.var##c.var
time-series operators
So researchers often see:
factor-variable and time-series operators not allowed
The correct strategy
Use either:
pre-created dummy variables (best for clarity + joint testing), or
xi: (quick compatibility).
9) Best practice: pre-create dummy variables + group them as a block
Why pre-created dummies are best for mfp
Because you can:
control the reference category,
preserve missingness correctly,
test the whole categorical predictor jointly with parentheses.
Example: 3-level thal (0=no, 1=trait, 2=disease)
gen thal_trait = (thal==1) if !missing(thal)
gen thal_disease = (thal==2) if !missing(thal)
Then use them as a block:
mfp, select( (thal_trait thal_disease):0.05, age hb:1 ) : ///
logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) cirrhos
What parentheses give you
(thal_trait thal_disease) is:
selected/dropped together
not FP-transformed (as it shouldn’t be)
This is the cleanest approach for manuscripts.
10) Quick compatibility: xi: mfp ... (how to use it)
If you want to keep writing i.thal i.cirrhos (old-school dummy generation), you can do:
xi: mfp, select(1) : ///
logistic group_gimalig age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib i.thal i.cirrhos
What happens
xi: expands i.thal and i.cirrhos into dummy variables (e.g., Ithal2, Ithal3, etc.)
mfp then sees only plain variables and runs
Limitation of xi: for mfp
xi: does not automatically group the generated dummies as a single block in the mfp sense. So selection can behave oddly (dropping one dummy and keeping another). That’s why manual dummies + parentheses is still preferred if you care about clean, principled selection.
11) How to interpret mfp output (the parts that matter)
In the FP selection table you will see something like:
“Lin. vs FP2” with deviance difference and p-value
“Final” with the chosen power(s)
How to read it
Large p-value for “Lin vs FP2” → linear is adequate
Small p-value → FP model improves fit → a transformation is selected
The generated variables
mfp will create variables like:
Iage__1, Iage__2, etc.
These are:
The transformed versions used in the final model
Typically centered (centering helps numeric stability; it does not change fit)
12) Common “researcher errors” with mfp (avoid these)
Treating select() p-values as “clinical truth” Selection is a modeling choice, not a biological conclusion.
Letting mfp select and then claiming performance is final Shape+selection is data-driven; it usually requires careful reporting and validation later.
Using xi: and expecting perfect categorical handling xi: is a compatibility tool, not a modeling philosophy.
Forgetting to group dummy variables A multi-level categorical predictor should usually be tested as a block.
13) Minimal “copy-ready” templates (logistic)
Template A — Full model, shape selection only
mfp, select(1) : logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib thal_trait thal_disease cir_comp cir_decomp
Template B — Force age+hb; allow selection others; allow FP2 for age+hb
mfp, select( ///
wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) (cir_comp cir_decomp) : 0.05, ///
age hb : 1 ///
) df(age hb:4) : ///
logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib (thal_trait thal_disease) (cir_comp cir_decomp)
Template C — Quick xi: compatibility
xi: mfp, select(1) : logistic y age hb wbc plt mcv rdw ferritin si male pain wtloss abnbm gib i.thal i.cirrhos
14) How to report mfp in a Methods section (short and correct)
A clean generic sentence:
“Continuous predictors were modeled using multivariable fractional polynomials (Stata mfp) with the default FP power set. Predictor inclusion was handled via backward elimination at a prespecified threshold, while clinically essential variables were forced into the model. Categorical predictors were entered using indicator (dummy) variables.”
If you used select(1) (full model):
“…with select(1) to force all candidate predictors to remain in the model while allowing FP functional-form selection.”






Comments