Choosing Link Functions and Families (Family) in GLMs: Outcome-Based

Mayta
Dec 12, 2025
5 min read

Outcome → Family → Link (Core GLM Mapping)

Outcome (Y)	What Y Is in the Real World	Family (Distribution)	Link	Effect Measure (Interpretation)
Continuous	Numerical, symmetric (e.g. BP, lab values)	Gaussian	Identity	Mean Difference
Continuous	Positive, right-skewed (ratio scale)	Gaussian (working model)	Log (ln)	Mean Ratio
Binary (0/1)	Event vs no event	Binomial	Identity	Risk Difference
Binary (0/1)	Event vs no event	Binomial	Log (ln)	Risk Ratio
Binary (0/1)	Event vs no event	Binomial	Logit	Odds Ratio
Count	Number of events	Poisson	Log (ln)	Count Ratio / IRR
Rate	Events per person-time	Poisson	Log (ln)	Incidence Rate Ratio
Count / Rate	Additive event change	Poisson	Identity	Count / Rate Difference
Count (overdispersed)	Variance > mean	Negative Binomial	Log (ln)	IRR (robust to overdispersion)
Binary (RR fallback)	Log-binomial fails	Poisson (working)	Log (ln)	Risk Ratio (robust SEs)

How to Read This Table (One-Line Rule)

Outcome type determines the Family. Scientific question (difference vs ratio vs odds) determines the Link.

Mental Shortcut

Difference → Identity link (−)
Ratio → Log link (ln → ÷)
Odds → Logit link (log-odds)
Continuous → Gaussian
Binary → Binomial
Count / Rate → Poisson (or Negative Binomial if overdispersed)

In generalized linear models (GLM) and generalized estimating equations (GEE), the link function is the first and most important conceptual step. Before thinking about distributions or variability, we must first answer a simple question:

What do we need to do to the outcome (Y) so that it can be expressed as a linear formula?

This is because regression models rely on a linear predictor (LP) of the form:

LP = α + βX

Real-world clinical data—means, risks, rates, probabilities—are rarely linear by nature. The link function is what transforms the outcome (or its mean) so that this linear structure becomes possible.

1. The Role of the Link Function

The link function acts on Y (more precisely, on the expected value of Y) to make the relationship between X and Y compatible with a straight-line equation.

Conceptually:

The link tells us what mathematical operation we must apply to Y so that the formula fits the data.

Different scientific questions require different links because they imply different types of effects.

2. Identity Link: Difference (−)

When we use the identity link, we do nothing to the outcome.

The mean of Y is modeled directly.
The effect is additive.
Interpretation is based on subtraction.

For two groups or two values of X:

Effect = μ1 − μ0

This is why the identity link is used for:

Mean difference
Risk difference

Intuition

The relationship is already linear on the original scale, so no transformation is needed. The straight-line model works naturally.

3. Log Link (ln): Ratio (÷) via Log Properties

When the scientific question is about ratios rather than differences, the identity link no longer works. Risks, rates, and counts usually change multiplicatively, not additively.

This is where the log link (natural log, ln) is used.

On the log scale, the model becomes linear:

ln(E[Y]) = α + βX

For two groups:

ln(μ1) − ln(μ0)

Using the property of logarithms:

So subtraction on the log scale corresponds to division on the original scale.

After transforming back:

This is why the log link is used for:

Risk ratio
Rate ratio
Mean ratio

Intuition

Ratios are linear only after taking ln.The log link converts multiplicative real-world behavior into additive linear behavior.

4. Logit Link: Log Odds

Probabilities are bounded between 0 and 1 and follow an S-shaped (sigmoid) pattern, which is not linear.

The logit link transforms probability into log odds:

On the log-odds scale:

The range becomes −∞ to +∞
The relationship with X becomes linear
Subtraction corresponds to an odds ratio

This is why the logit link is used for:

Odds ratios in logistic regression

5. Why All of This Is Necessary

All three links—identity, log, and logit—exist for the same fundamental reason:

We must transform real-world outcomes into a linear form so that we can construct a linear predictor (LP).

Once the linear predictor exists: LP = α + βX

Then—and only then—can we estimate coefficients, test hypotheses, and interpret effects.

6. The Family: Where the Distribution Comes In

Once the link function has done its job—turning the real-world outcome into a linear predictor—we are still missing one essential part of the model:

Real data never lie exactly on the linear predictor.Observed Y values always vary around the expected value.

This variability is not random chaos. It follows recognizable statistical patterns.The family specifies which probability distribution describes that variability.

In other words:

The link defines the mean structure.The family defines the distribution of Y around that mean.

7. Why We Need a Family at All

In theory, the linear predictor gives us:

E [ Y ∣ X ]

But in practice, for the same X value, we observe many different Y values.

Example:

Same treatment
Same covariates
Different patients→ Different outcomes

This spread of values is what creates a distribution.

The family answers the question:

“Given the mean predicted by the linear predictor, how does Y behave?”

8. Families Reflect How Outcomes Behave in the Real World

Clinical outcomes are not arbitrary. Over decades, we have learned that they fall into a small number of natural outcome types, each with its own distribution.

Gaussian (Normal) Family

Used when Y is continuous and symmetric.

Blood pressure
Weight
Lab values

Here:

Y can take any real value
Errors are roughly symmetric around the mean
Variability is described by a standard deviation

This is why continuous outcomes are modeled with a Gaussian family.

Binomial Family

Used when Y is binary (0/1, yes/no, risk, odds).

Disease vs no disease
Death vs survival

Here:

Each observation is an event or a non-event
The mean represents a probability
Variability follows a Bernoulli / binomial process

This is why risks, odds, and probabilities use the binomial family.

Poisson Family

Used when Y is a count or rate.

Number of events
Incidence per person-time

Here:

Y takes non-negative integers (0, 1, 2, …)
Variability increases with the mean
Events are assumed to occur independently over time

This is why event counts and incidence rates use the Poisson family.

Negative Binomial Family

Used when the count data are more variable than Poisson allows.

Same outcome type as Poisson
But the variance is larger than the mean

This family exists because real clinical data are often messier than theory.

9. Family Is Not About “This Dataset Only”

A key point in your idea is very important:

The family is not chosen just because of the current dataset.

Instead:

It reflects how this type of outcome behaves in general
Across studies
Across populations
Across repeated measurements

So we do not say:

“My data look Gaussian, so I choose Gaussian.”

We say:

“This outcome is continuous by nature, so Gaussian is appropriate.”

This is a conceptual decision, not a cosmetic one.

10. Putting Link and Family Together

Now everything connects:

Start with the real-world outcome
- Continuous → Gaussian
- Binary → Binomial
- Count / rate → Poisson / Negative binomial
Choose the scientific effect
- Difference → Identity link
- Ratio → Log link (ln)
- Odds → Logit link
Build the linear predictor

link (E[Y]) = α + βX

4. Use the family to describe how Y varies around that mean

Summary

The link function transforms the outcome so that reality becomes linear and can be written as a linear predictor. The family specifies the probability distribution that governs how observed outcomes vary around that linear mean. Together, link + family define the full statistical model.

Choosing Link Functions and Families (Family) in GLMs: Outcome-Based

Outcome → Family → Link (Core GLM Mapping)

How to Read This Table (One-Line Rule)

Mental Shortcut

1. The Role of the Link Function

2. Identity Link: Difference (−)

Intuition

3. Log Link (ln): Ratio (÷) via Log Properties

Intuition

4. Logit Link: Log Odds

5. Why All of This Is Necessary

6. The Family: Where the Distribution Comes In

7. Why We Need a Family at All

8. Families Reflect How Outcomes Behave in the Real World

Gaussian (Normal) Family

Binomial Family

Poisson Family

Negative Binomial Family

9. Family Is Not About “This Dataset Only”

10. Putting Link and Family Together

Summary

Recent Posts

Comments