Sample Size for Clinical Prediction Models: Using pmsampsize and the Riley Framework

Mayta
Jan 9
5 min read

Introduction

Sample size calculation is one of the most misunderstood aspects of medical research because there is no single universal rule. The correct approach depends entirely on what the study is trying to achieve.

Designing a study is not merely about enrolling participants and running analyses. It is about anticipating the interplay between clinical importance, statistical rigor, ethical responsibility, and resource constraints. Sample size sits at the center of this balance.

The BRAVES method provides a structured way to think about sample size when the objective is hypothesis testing, while recognizing that other research objectives require entirely different logic.

Sample Size Depends on the Research Objective

Before calculating anything, the first question must be:

What is the objective of this research?

A medical study may aim to:

test a hypothesis,
build a prediction model,
estimate a parameter precisely,
evaluate a complex or adaptive design,
or analyze special data structures (e.g., clustered, longitudinal, rare events).

Each objective demands different assumptions, criteria, and stopping rules. Applying hypothesis-testing logic to all studies is a common and costly mistake.

This article moves beyond hypothesis testing and introduces Objective 2: Clinical Prediction Models (CPM), in which sample size is driven by model stability and overfitting control, rather than by power for a p-value.

Objective: Clinical Prediction Models (CPM)

“Can we predict individual outcomes reliably?”

Purpose

To build a model that predicts risk accurately in new patients, not to test statistical significance.

The aim is generalizable performance (calibration and discrimination), not hypothesis rejection.

Sample Size Logic

Sample size is chosen to:

prevent overfitting,
ensure stable coefficients,
obtain reliable model performance in new patients.

Key Criterion

Model stability and shrinkage
Precision of calibration and discrimination

Modern Approach

Riley et al. framework
Implemented via pmsampsize (R / Stata)

Typical Use

Risk scores
Prognostic models
Clinical decision tools

Main question: How much data is needed to build a model that works in real patients?

Concept → How It Works → Stata Commands → How to Find the Inputs

1) Concept — What pmsampsize Is (and Why We Use It)

pmsampsize is a design-stage tool for clinical prediction models (CPMs) that tells you how large your development dataset must be (total N and number of events), or how complex your model is allowed to be, so that overfitting is small and calibration is reliable.

It replaces crude rules like “10 events per variable” by linking sample size to explicit targets for:

overfitting control (shrinkage),
optimism in model fit (R²),
precision of the overall outcome level (risk or mean).

In CPM work, this design-first step sits between defining the point of prediction and planning derivation/validation. It ensures calibration and discrimination are not artifacts of an undersized dataset.

2) How pmsampsize Works — Criteria, Not Heuristics

pmsampsize calculates the minimum N required to satisfy multiple criteria simultaneously and then selects the largest required N.

Binary and Survival Outcomes: Three Criteria

Small overfittingTarget shrinkage (calibration slope) ≥ 0.90 (≤10% overfitting)
Limited optimism in apparent model fitKeep the difference between apparent and adjusted R² small (commonly ≤0.05)
Precise overall outcome levelEstimate overall risk (binary) or cumulative incidence (survival) with prespecified precision

Continuous Outcomes: Four Criteria

Adds precision targets for:

mean outcome
residual SD

What drives the sample size most

Model complexity is counted in degrees of freedom (df), not variable names:

spline terms,
categories,
interactions

all spend df and increase the required number of events.

3) Stata Usage — Command Patterns

If your code failed, it is likely because the type(binary) is invalid syntax.

In pmsampsize, the model type must be specified as:

type(b) = binary outcome
type(s) = survival (time-to-event) outcome
type(c) = continuous outcome

A) Binary Outcome (Diagnostic or Prognostic CPM)

If you have an R² measure (Nagelkerke):

pmsampsize, type(b) ///
    parameters(10) prevalence(0.10) ///
    nagrsquared(0.15) shrinkage(0.90)

If you only have AUROC:

pmsampsize, type(b) ///
    parameters(10) prevalence(0.10) ///
    cstatistic(0.78) shrinkage(0.90)

B) Survival Outcome (Time-to-Event Prognostic CPM)

pmsampsize, type(s) ///
    parameters(12) ///
    rate(0.065) meanfup(2.07) timepoint(2) ///
    csrsquared(0.051) shrinkage(0.90)

Important: rate(), meanfup(), and timepoint() must be expressed in the same time units.

C) Continuous Outcome CPM

pmsampsize, type(c) ///
    parameters(25) rsquared(0.20) ///
    intercept(1.9) sd(0.6) shrinkage(0.90)

What the Output Means

pmsampsize returns a table showing the sample size (and number of events) required by each design criterion. The largest value across criteria is the minimum recommended development sample size.

This is your design-justified N, not a rule-of-thumb estimate.

If You Want to Know How Many Predictors You Can Use (Fixed N)

The commands above answer:

“How large must my dataset be for this planned model?”

In many studies, however, the dataset already exists. In that case, the question reverses:

“Given my fixed sample size, how complex can my model be?”

Here, pmsampsize switches from a sample-size planning tool to a model-complexity budgeting tool.

Fixed N → Solve for Predictor Parameters (Degrees of Freedom)

When N is fixed:

omit parameters(), and
specify n(#) instead.

Only one of parameters() or n() may be used.

A) Binary Outcome — Fixed Dataset

pmsampsize, type(b) ///
    n(200) prevalence(0.10) ///
    nagrsquared(0.15)

Interpretation:

With N = 200, 10% prevalence, and anticipated Nagelkerke R² ≈ 0.15,pmsampsize returns the maximum number of candidate predictor parameters (df) that can be safely considered.

This includes:

spline terms
categorical levels
interaction terms

—not just predictor names.

If You Only Have AUROC

pmsampsize, type(b) ///
    n(200) prevalence(0.10) ///
    cstatistic(0.78)

Here, pmsampsize approximates Cox–Snell R² from the AUROC and prevalence, then determines allowable model complexity under the shrinkage and optimism criteria.

B) Survival Outcome — Fixed Dataset

pmsampsize, type(s) ///
    n(500) ///
    rate(0.06) meanfup(5) timepoint(3) ///
    csrsquared(0.12)

This answers:

“With this sample size, event rate, and follow-up, how many predictor parameters can I safely include?”

C) Continuous Outcome — Fixed Dataset

pmsampsize, type(c) ///
    n(150) ///
    rsquared(0.25) ///
    intercept(50) sd(10)

The output reports:

maximum allowable parameters,
expected shrinkage,
precision of the mean and residual SD.

How to Read the Output (Key Point)

When using n():

r(parameters) = maximum candidate predictor parameters (df)
r(final_shrinkage) = expected global shrinkage
r(EPP) (binary/survival) = implied events per parameter

This defines how complex your model may be, not how many predictors you must include.

Critical Distinction

Planning N (parameters()) is a design-stage decision.Limiting predictors (n()) is a data-stage decision.

The same framework is used, but the questions are opposite.

Methods-Ready Justification Sentence

“Given the fixed sample size, we applied the Riley et al. criteria using pmsampsize to determine the maximum number of predictor parameters that could be safely considered during model development, targeting minimal overfitting and reliable calibration.”

Note

Use parameters() when planning data collection
Use n() when the dataset already exists
The output defines your predictor budget
Count degrees of freedom, not variable names

4) How to Find Each Input — Practical Sourcing

4.1 Outcome type: type(b | s | c)

Choose based on endpoint and time horizon:

Binary: disease yes/no at a fixed time
Survival: time-to-event with prediction horizon
Continuous: numeric outcome

4.2 Planned model complexity: parameters(#) = df

Count df, not predictors:

Binary predictor = 1 df
Continuous linear term = 1 df
Restricted cubic spline (4 knots) = 3 df
Categorical with k levels = k − 1 df
Interactions = add df for each product term

4.3 Anticipated performance: csrsquared() or nagrsquared() or cstatistic()

Source conservatively from:

similar CPMs (prefer external validation values)
pilot data
published AUC when R² is not available (use cstatistic())

4.4 Outcome frequency: prevalence() (binary) or event info (survival)

Binary: prevalence in the intended target population (not case-control sampling fraction).Survival: event rate plus mean follow-up and the time horizon.

4.5 Overfitting tolerance: shrinkage()

Default 0.90 is standard. Consider 0.95 if decisions are high-stakes.

5) Worked Clinical Sketch — Diagnosing Cirrhosis

Setting: 1,000 evaluated; 100 cirrhosis (10%).Planned: 10 df.Anticipated performance: Nagelkerke R² ≈ 0.15.

pmsampsize, type(b) parameters(10) prevalence(0.10) nagrsquared(0.15) shrinkage(0.90)

If required N ≤ 1,000 (≈100 events), proceed. If required N > 1,000, simplify df, enlarge dataset, or revise expectations.

6) Why This Beats “10 EPV”

Performance-anchored (targets calibration and optimism)
df-aware (splines, categories, interactions count)
outcome-aware (prevalence/follow-up matter)
transparent (shows which criterion drives N)

Recap

There is no universal sample size rule; choose based on objective.
CPM sample size is about stability and overfitting control, not p-values.
In Stata, type() must be b, s, or c.
Provide one performance anchor: csrsquared() or nagrsquared() or cstatistic().
Use either parameters() (solve for N) or n() (solve for max df), not both.