pmsampsize for Clinical Prediction Models (Clinical Prediction Modeling [CPM]: step 4 Plan Sample Size)
- Mayta

- Oct 8
- 6 min read
Concept → How it works → Stata commands → How to find the inputs
1) Concept—what pmsampsize is (and why we use it)
pmsampsize is a design‑stage tool for clinical prediction models (CPMs) that tells you how large your development dataset (N and number of events) must be—or, equivalently, how complex your model is allowed to be—so that overfitting is small and calibration is reliable. It replaces crude rules like “10 events per variable,” aligning sample size with explicit performance targets for your planned model. In our CPM roadmap, we explicitly “ditch the 10‑EPV rule” and plan to use pmsampsize because requirements depend on the event fraction, intended degrees of freedom (df), and target performance (e.g., AUC, R²), not on a fixed EPV cutoff.
In prognostic/diagnostic CPM work, this design‑first step sits between specifying your point of prediction and your derivation/validation plan. It ensures that downstream discrimination and calibration are not artifacts of an undersized, over‑parameterized dataset.
2) How pmsampsize works—criteria, not heuristics (rule of thumb)
pmsampsize computes the minimum total N (and events) needed so that your model meets simultaneously three design criteria (the command takes the largest N across them):
Small overfitting: target global shrinkage (calibration slope) ≥ 0.90—i.e., ≤10% overfitting.
Limited optimism in apparent fit: keep the difference between apparent and adjusted R² small (commonly ≤0.05).
Precise overall outcome level: estimate the overall mean outcome (continuous) or overall risk (binary/survival) with prespecified precision (e.g., ±0.05 on the risk scale).
These are adapted to outcome type (binary, survival, continuous) and to your planned model complexity measured in degrees‑of‑freedom (df), not “variable labels.” Spline terms, categories, and interactions each consume df and therefore require more events.
pmsampsize computes the minimum N that satisfies multiple criteria (limit shrinkage, cap optimism in R², and ensure precise overall outcome risk), then returns the max events and recommended N. For a fixed dataset, it can instead compute the max number of predictor parameters you can safely consider.
The key options you must set
A. Pick one performance anchor (you must inform expected model strength):
csrsquared() (Cox–Snell R², adjusted), or
nagrsquared() (Nagelkerke R², adjusted), or
cstatistic() (AUROC; the software converts to an R² surrogate).
B. Tell it whether you want N or max predictors (mutually exclusive):
parameters(#) → solve for N needed given # planned predictor parameters.
n(#) → with a fixed sample size, solve for how many predictor parameters you can afford. (Use n() and omit parameters(); only one can be given.)
C. Outcome-specific inputs
Binary: prevalence() (event proportion).
Survival: rate() (overall event rate per time unit), meanfup() (mean follow-up), timepoint() (prediction horizon); time units must match.
Continuous: intercept() (mean outcome) and sd() of outcome in target population.
3) Stata usage—canonical command patterns
Below are typical Stata calls you can lift into a Methods appendix. (Use your own anticipated performance and df; see §4 to find inputs.)
A. Binary outcome (diagnostic CPM)
pmsampsize, type(binary) ///
parameters(10) prevalence(0.10) ///
rsquared(0.15) shrinkage(0.90)
parameters() = total df planned (count spline terms, categories, interactions).
Use rsquared() (anticipated Nagelkerke R²). If you only have an AUC, see help: the module supports planning based on anticipated discrimination and converts to an R² for the calculations.
B. Survival outcome (prognostic CPM)
pmsampsize, type(survival) ///
parameters(12) meanfup(5) ///
rsquared(0.12) shrinkage(0.90)
Add meanfup() for average follow‑up (years), and supply the anticipated model fit on the appropriate scale.
C. Continuous outcome
pmsampsize, type(continuous) ///
parameters(8) rsquared(0.25) shrinkage(0.90)
The tool also targets precision of the residual SD and mean outcome in continuous models.
What you’ll see: a small table listing the N (and events) required by each criterion (shrinkage, R² optimism, outcome precision), plus the final required minimum—the maximum across the three. That number is your design‑justified development sample size.
4) How to find each input—practical sourcing
What drives the inputs (Steps 2–3 first)
Step 2 — Define the clinical prediction question precisely: population, outcome, and especially the prediction point (e.g., at ED triage vs. day-3 of admission). This “when” determines what data are available and what time horizon you predict.
Step 3 — Choose the right design: cross-sectional (diagnostic CPM) vs. cohort (prognostic CPM).
These two steps fix the outcome type you’ll pass to pmsampsize:
Binary (type b): diagnostic/prognostic at a fixed time.
Survival (type s): time-to-event with a time horizon (needs event rate, mean follow-up, horizon).
Continuous (type c): continuous outcome prediction.
“Where do I find each input?” (fast mapping to Steps 2–5)
Prediction point & time horizon (Step 2): define exactly when the model will be used (ED vs. day-3, discharge, etc.) and over what timeframe it predicts; this choice governs whether you use binary vs. survival inputs.
Prevalence / event rate: pull from prior studies, registries, or your own cohort at the same prediction point; for survival, also extract mean follow-up and set the timepoint you’ll predict. (Match time units.)
Model strength (R² or AUROC): take a conservative value from an external validation or a similar model; when you only have AUROC, use cstatistic().
parameters() (Step 4: “How many predictors can I afford?”): count candidate parameters (not just variables—include df for nonlinearity, interactions). If N is fixed, switch to n() to compute the budget.
Step 5 (choose predictors within budget): pre-specify clinically justified candidates; avoid dichotomizing continuous predictors; consider splines for nonlinearity.
Use this checklist to specify inputs before you run the command.
4.1 Outcome type: type(binary | survival | continuous)
Binary for diagnosis now (disease yes/no), survival for time‑to‑event prognosis, continuous for numeric outcomes.
Pick the type that matches your CPM’s endpoint and time horizon.
4.2 Planned model complexity: parameters(#) = degrees‑of‑freedom
Count df, not variable names:
Binary predictor → 1 df
Continuous linear term → 1 df
Continuous with restricted cubic spline (e.g., 4 knots) → 3 df
Categorical with k levels → k − 1 df
Interactions → add df for every product termPlan this up front to avoid stealth overfitting.
4.3 Anticipated performance: rsquared(#) (or AUC via the help options)
Source one of the following:
Literature on similar CPMs (reported Nagelkerke R² or AUC).
Pilot data (fit a reasonable full model, get R²).
If you only have AUC, consult the command’s help for discrimination‑to‑R² conversion; pmsampsize supports planning from anticipated discrimination as the primary input. Conservative planning often uses R² ≈ 0.10–0.20 for moderate clinical models unless stronger prior evidence exists.
4.4 Event fraction or outcome level: prevalence(#) (binary) / event info (survival)
Binary: Use the expected disease prevalence in your intended target population (not a case–control sampling fraction).
Survival: Supply design quantities (e.g., follow‑up; event fraction at the time horizon) per the module’s options.
4.5 Overfitting tolerance: shrinkage(#)
Default 0.90 is standard; tighten (e.g., 0.95) for high‑stakes decisions or loosen slightly if using stronger penalization and planning rigorous internal validation.
TIP—diagnostic vs prognostic orientation: Confirm your point of prediction (when risk is estimated) and ensure outcome definition and prevalence match your intended users. This affects both prevalence() and the realism of your anticipated performance.
5) Worked on a clinical sketch—diagnosing cirrhosis
Setting: 1,000 patients evaluated; 100 have cirrhosis (10%).
Plan: 10 df (e.g., two labs with splines, six binary predictors).
Target performance: R² ≈ 0.15 (moderate).
Call:
pmsampsize, type(binary) parameters(10) prevalence(0.10) rsquared(0.15) shrinkage(0.90)
If the output’s required minimum is ≤ 1,000 (with ≈100 events), proceed; if it’s > 1,000, either simplify the model (fewer df), enlarge the sample, or revise performance expectations. This is a principled alternative to “10 EPV says you can use 10 predictors.”
6) Why this beats “10 EPV”
Performance‑anchored: ties N to your intended AUC/R² and calibration, not a fixed ratio.
Df‑aware: counts how predictors are modeled (splines, categories, interactions).
Outcome‑aware: respects disease prevalence and follow‑up.
Transparent: shows which criterion drives the requirement, guiding design trade‑offs.This is the CPM design logic we teach: plan N from targets, then validate; don’t back‑justify a complex model with a rule‑of‑thumb EPV.
Recap
Define prediction point and design first; they lock in the outcome type and time horizon.
Provide one performance anchor: csrsquared() or nagrsquared() or cstatistic().
Choose either parameters() or n() depending on whether you’re solving for N or for max predictors.
This replaces the 10-EPV rule with principled, multi-criteria sizing.
Key takeaways
pmsampsize gives a data‑driven minimum N/events for CPM development by enforcing shrinkage, optimism, and outcome‑level precision targets.
Count degrees‑of‑freedom, not variable labels; splines, categories, and interactions “spend” df.
Source inputs from literature/pilot data and from the intended prediction setting (prevalence/time horizon).
Define prediction point and design first; they lock in the outcome type and time horizon.
Choose either parameters() or n() depending on whether you’re solving for N or for max predictors.
Provide one performance anchor: csrsquared() or nagrsquared() or cstatistic().
This approach is the endorsed replacement for 10‑EPV in modern CPM development.





Comments