Why power twomeans in Stata Does Not Always Need “Real Means”

Introduction

A lot of people get stuck the first time they see this in Stata:

power twomeans m1 m2, ...

The syntax says means, so it feels natural to think m1 and m2 must always be the actual mean outcome in two groups.

That is true in many ordinary superiority studies. But it is not the whole story.

In practice, researchers sometimes enter values like 0 and 2 in a non-inferiority design and still get a valid sample size calculation. At first glance, that looks wrong. Why would “0” and “2” count as means?

The answer is simple but important: in a two-sample mean comparison, power is driven by the difference between the two inputs, not by their absolute location on the scale.

That is the key idea.

The command is labeled in means, but the mathematics runs on the mean difference

When Stata calculates sample size for power twomeans, it uses the ingredients that actually determine detectability:

the difference between groups
the standard deviations
the alpha level
the target power
the allocation ratio

What matters most is the signal-to-noise relationship: how large the group difference is relative to variability.

So while the syntax is written as:

power twomeans m1 m2

what really drives the calculation is:

m2 - m1

Once the standard deviations are fixed, shifting both inputs up or down by the same amount does not change the power.

That means these three setups are mathematically equivalent for sample size:

power twomeans 50 52, sd1(3) sd2(3)power twomeans 0 2,  sd1(3) sd2(3)power twomeans -10 -8, sd1(3) sd2(3)

All three represent the same gap: 2 units.

The absolute values are different. The effect size is the same.

Why this makes sense

Imagine you are comparing body weight reduction between two groups. Suppose the clinically relevant difference is 2 percentage points, and the standard deviation in each group is about 3.

From the perspective of power, these scenarios are identical:

Group A mean = 20, Group B mean = 22
Group A mean = 0, Group B mean = 2
Group A mean = 48, Group B mean = 50

In each case, the gap is 2 and the variability is the same.

For sample size, the command does not care whether the comparison happens around 0, 20, or 50. It cares about whether a 2-unit separation can be detected against the background noise.

The usual case: superiority trials

In a conventional superiority design, people often enter the actual expected means:

power twomeans 0.3637 0.283, sd1(0.1225) sd2(0.1245) alpha(0.05) power(0.80)

Here the inputs are interpreted naturally:

0.3637 = expected mean in one group
0.283 = expected mean in the other group

That is completely fine. In this setting, the observed or expected group means are the most intuitive way to define the effect.

The less intuitive case: non-inferiority trials

Now consider a non-inferiority setting.

Suppose the outcome is percent weight reduction, and the investigator decides that being worse by up to 2 percentage points is still clinically acceptable. That 2-point value is the non-inferiority margin.

A senior might write:

power twomeans 0 2, sd1(3) sd2(3) onesided nratio(0.5)

Why does this work?

Because the calculation is being set up on the difference scale.

Instead of thinking in terms of raw group means, the user is encoding the contrast of interest:

0 = no difference
2 = the acceptable boundary or target gap on the outcome scale

So the model is effectively being powered for a 2-unit separation, with SD = 3, under a one-sided framework.

That is why values that do not look like real means can still be useful in this command.

So are those values “means” or not?

This is where the confusion usually comes from.

Formally, Stata’s syntax labels them as means because the command is designed for comparing two means.

But practically, once the standard deviations are given, the procedure depends on the difference between the two values. That lets users work with:

actual expected means
a mean and a margin-translated value
any pair of numbers that correctly encodes the intended difference on the same measurement scale

So the better mental model is not:

“These must always be literal observed means.”

It is:

“These two numbers define the difference I am powering for.”

A concrete example

Take these three commands:

power twomeans 78 80, sd1(5) sd2(5)power twomeans 0 2,  sd1(5) sd2(5)power twomeans 100 102, sd1(5) sd2(5)

All three describe a difference of 2 with identical variability.

So they produce the same sample size requirement.

That is the clearest demonstration of why power twomeans can be used with values other than literal group means.

Why this is especially common in non-inferiority work

Non-inferiority studies are often framed around a margin, not around a hoped-for superiority effect.

The real clinical question becomes:

How much worse can the new treatment be and still be considered acceptable?

That margin lives on the same outcome scale as the mean difference. So researchers often think directly in terms of the contrast rather than raw group means.

That is why code like 0 and 2 shows up so often in teaching or protocol drafts. It is not ignoring the mean. It is expressing the design on a different scale.

A note of caution

This does not mean you can plug in any arbitrary numbers carelessly.

The approach only makes sense when:

the outcome is continuous and analyzed as a mean difference
the standard deviations reflect the same outcome scale
the difference being encoded is clinically and statistically meaningful
the direction of the one-sided hypothesis matches the trial objective

Also, non-inferiority margins should never be invented casually. They need clinical justification.

And this logic does not transfer blindly to every outcome type. For proportions, survival outcomes, rates, or odds ratios, you should use the matching power command for that design rather than treating everything like a mean.

The practical takeaway

power twomeans looks like a command about two raw means, but for sample size it is really driven by the difference between them.

That is why both of these can be valid ways to think:

Superiority framing: use expected group means
Non-inferiority framing: use values that encode the clinically important difference or margin

So when someone writes:

power twomeans 0 2, sd1(3) sd2(3) onesided

They are not misusing the command just because 0 and 2 are not literal observed means. They are using the command on the scale that matters for power: the target mean difference.

Once that clicks, the syntax becomes much less mysterious.