← All posts

Choosing Link Functions and Families (Family) in GLMs: Outcome-Based

Clinical Epidemiology ResearchUniqcret doctor knowledgesMethodology and Research Design
Choosing Link Functions and Families (Family) in GLMs: Outcome-Based

Outcome → Family → Link (Core GLM Mapping)

Outcome (Y)What Y Is in the Real WorldFamily (Distribution)LinkEffect Measure (Interpretation)
ContinuousNumerical, symmetric (e.g. BP, lab values)GaussianIdentityMean Difference
ContinuousPositive, right-skewed (ratio scale)Gaussian (working model)Log (ln)Mean Ratio
Binary (0/1)Event vs no eventBinomialIdentityRisk Difference
Binary (0/1)Event vs no eventBinomialLog (ln)Risk Ratio
Binary (0/1)Event vs no eventBinomialLogitOdds Ratio
CountNumber of eventsPoissonLog (ln)Count Ratio / IRR
RateEvents per person-timePoissonLog (ln)Incidence Rate Ratio
Count / RateAdditive event changePoissonIdentityCount / Rate Difference
Count (overdispersed)Variance > meanNegative BinomialLog (ln)IRR (robust to overdispersion)
Binary (RR fallback)Log-binomial failsPoisson (working)Log (ln)Risk Ratio (robust SEs)

How to Read This Table (One-Line Rule)

Outcome type determines the Family. Scientific question (difference vs ratio vs odds) determines the Link.

Mental Shortcut


In generalized linear models (GLM) and generalized estimating equations (GEE), the link function is the first and most important conceptual step. Before thinking about distributions or variability, we must first answer a simple question:

What do we need to do to the outcome (Y) so that it can be expressed as a linear formula?

This is because regression models rely on a linear predictor (LP) of the form:

LP = α + βX

Real-world clinical data—means, risks, rates, probabilities—are rarely linear by nature. The link function is what transforms the outcome (or its mean) so that this linear structure becomes possible.


1. The Role of the Link Function

The link function acts on Y (more precisely, on the expected value of Y) to make the relationship between X and Y compatible with a straight-line equation.

Conceptually:

The link tells us what mathematical operation we must apply to Y so that the formula fits the data.

Different scientific questions require different links because they imply different types of effects.


2. Identity Link: Difference (−)

When we use the identity link, we do nothing to the outcome.

For two groups or two values of X:

Effect = μ1​ − μ0​

This is why the identity link is used for:

Intuition

The relationship is already linear on the original scale, so no transformation is needed. The straight-line model works naturally.


3. Log Link (ln): Ratio (÷) via Log Properties

When the scientific question is about ratios rather than differences, the identity link no longer works. Risks, rates, and counts usually change multiplicatively, not additively.

This is where the log link (natural log, ln) is used.

On the log scale, the model becomes linear:

ln(E[Y]) = α + βX

For two groups:

ln(μ1​) − ln(μ0​)

Using the property of logarithms:

ln(a) - ln(b) = ln ( a b )

So subtraction on the log scale corresponds to division on the original scale.

After transforming back:

μ1 μ0

This is why the log link is used for:

Intuition

Ratios are linear only after taking ln.The log link converts multiplicative real-world behavior into additive linear behavior.


4. Logit Link: Log Odds

Probabilities are bounded between 0 and 1 and follow an S-shaped (sigmoid) pattern, which is not linear.

The logit link transforms probability into log odds:

ln ( p 1p )

On the log-odds scale:

This is why the logit link is used for:


5. Why All of This Is Necessary

All three links—identity, log, and logit—exist for the same fundamental reason:

We must transform real-world outcomes into a linear form so that we can construct a linear predictor (LP).

Once the linear predictor exists: LP = α + βX

Then—and only then—can we estimate coefficients, test hypotheses, and interpret effects.


6. The Family: Where the Distribution Comes In

Once the link function has done its job—turning the real-world outcome into a linear predictor—we are still missing one essential part of the model:

Real data never lie exactly on the linear predictor.Observed Y values always vary around the expected value.

This variability is not random chaos. It follows recognizable statistical patterns.The family specifies which probability distribution describes that variability.

In other words:

The link defines the mean structure.The family defines the distribution of Y around that mean.


7. Why We Need a Family at All

In theory, the linear predictor gives us:

E [ Y ∣ X ]

But in practice, for the same X value, we observe many different Y values.

Example:

This spread of values is what creates a distribution.

The family answers the question:

“Given the mean predicted by the linear predictor, how does Y behave?”


8. Families Reflect How Outcomes Behave in the Real World

Clinical outcomes are not arbitrary. Over decades, we have learned that they fall into a small number of natural outcome types, each with its own distribution.

Gaussian (Normal) Family

Used when Y is continuous and symmetric.

Here:

This is why continuous outcomes are modeled with a Gaussian family.

Binomial Family

Used when Y is binary (0/1, yes/no, risk, odds).

Here:

This is why risks, odds, and probabilities use the binomial family.

Poisson Family

Used when Y is a count or rate.

Here:

This is why event counts and incidence rates use the Poisson family.

Negative Binomial Family

Used when the count data are more variable than Poisson allows.

This family exists because real clinical data are often messier than theory.


9. Family Is Not About “This Dataset Only”

A key point in your idea is very important:

The family is not chosen just because of the current dataset.

Instead:

So we do not say:

“My data look Gaussian, so I choose Gaussian.”

We say:

“This outcome is continuous by nature, so Gaussian is appropriate.”

This is a conceptual decision, not a cosmetic one.


10. Putting Link and Family Together

Now everything connects:

  1. Start with the real-world outcome
    • Continuous → Gaussian
    • Binary → Binomial
    • Count / rate → Poisson / Negative binomial
  2. Choose the scientific effect
    • Difference → Identity link
    • Ratio → Log link (ln)
    • Odds → Logit link
  3. Build the linear predictor

link (E[Y]) = α + βX

4. Use the family to describe how Y varies around that mean


Summary

The link function transforms the outcome so that reality becomes linear and can be written as a linear predictor. The family specifies the probability distribution that governs how observed outcomes vary around that linear mean. Together, link + family define the full statistical model.