top of page

Choosing Link Functions and Families (Family) in GLMs: Outcome-Based

  • Writer: Mayta
    Mayta
  • Dec 12, 2025
  • 5 min read

Outcome → Family → Link (Core GLM Mapping)

Outcome (Y)

What Y Is in the Real World

Family (Distribution)

Link

Effect Measure (Interpretation)

Continuous

Numerical, symmetric (e.g. BP, lab values)

Gaussian

Identity

Mean Difference

Continuous

Positive, right-skewed (ratio scale)

Gaussian (working model)

Log (ln)

Mean Ratio

Binary (0/1)

Event vs no event

Binomial

Identity

Risk Difference

Binary (0/1)

Event vs no event

Binomial

Log (ln)

Risk Ratio

Binary (0/1)

Event vs no event

Binomial

Logit

Odds Ratio

Count

Number of events

Poisson

Log (ln)

Count Ratio / IRR

Rate

Events per person-time

Poisson

Log (ln)

Incidence Rate Ratio

Count / Rate

Additive event change

Poisson

Identity

Count / Rate Difference

Count (overdispersed)

Variance > mean

Negative Binomial

Log (ln)

IRR (robust to overdispersion)

Binary (RR fallback)

Log-binomial fails

Poisson (working)

Log (ln)

Risk Ratio (robust SEs)

How to Read This Table (One-Line Rule)

Outcome type determines the Family. Scientific question (difference vs ratio vs odds) determines the Link.

Mental Shortcut

  • Difference → Identity link (−)

  • Ratio → Log link (ln → ÷)

  • Odds → Logit link (log-odds)

  • Continuous → Gaussian

  • Binary → Binomial

  • Count / Rate → Poisson (or Negative Binomial if overdispersed)


In generalized linear models (GLM) and generalized estimating equations (GEE), the link function is the first and most important conceptual step. Before thinking about distributions or variability, we must first answer a simple question:

What do we need to do to the outcome (Y) so that it can be expressed as a linear formula?

This is because regression models rely on a linear predictor (LP) of the form:

LP = α + βX

Real-world clinical data—means, risks, rates, probabilities—are rarely linear by nature. The link function is what transforms the outcome (or its mean) so that this linear structure becomes possible.

1. The Role of the Link Function

The link function acts on Y (more precisely, on the expected value of Y) to make the relationship between X and Y compatible with a straight-line equation.

Conceptually:

The link tells us what mathematical operation we must apply to Y so that the formula fits the data.

Different scientific questions require different links because they imply different types of effects.

2. Identity Link: Difference (−)

When we use the identity link, we do nothing to the outcome.

  • The mean of Y is modeled directly.

  • The effect is additive.

  • Interpretation is based on subtraction.

For two groups or two values of X:

Effect = μ1​ − μ0​

This is why the identity link is used for:

  • Mean difference

  • Risk difference

Intuition

The relationship is already linear on the original scale, so no transformation is needed. The straight-line model works naturally.

3. Log Link (ln): Ratio (÷) via Log Properties

When the scientific question is about ratios rather than differences, the identity link no longer works. Risks, rates, and counts usually change multiplicatively, not additively.

This is where the log link (natural log, ln) is used.

On the log scale, the model becomes linear:

ln(E[Y]) = α + βX

For two groups:

ln(μ1​) − ln(μ0​)

Using the property of logarithms:



So subtraction on the log scale corresponds to division on the original scale.

After transforming back:

This is why the log link is used for:

  • Risk ratio

  • Rate ratio

  • Mean ratio

Intuition

Ratios are linear only after taking ln.The log link converts multiplicative real-world behavior into additive linear behavior.

4. Logit Link: Log Odds

Probabilities are bounded between 0 and 1 and follow an S-shaped (sigmoid) pattern, which is not linear.

The logit link transforms probability into log odds:


On the log-odds scale:

  • The range becomes −∞ to +∞

  • The relationship with X becomes linear

  • Subtraction corresponds to an odds ratio

This is why the logit link is used for:

  • Odds ratios in logistic regression


5. Why All of This Is Necessary

All three links—identity, log, and logit—exist for the same fundamental reason:

We must transform real-world outcomes into a linear form so that we can construct a linear predictor (LP).

Once the linear predictor exists: LP = α + βX

Then—and only then—can we estimate coefficients, test hypotheses, and interpret effects.

6. The Family: Where the Distribution Comes In

Once the link function has done its job—turning the real-world outcome into a linear predictor—we are still missing one essential part of the model:

Real data never lie exactly on the linear predictor.Observed Y values always vary around the expected value.

This variability is not random chaos. It follows recognizable statistical patterns.The family specifies which probability distribution describes that variability.

In other words:

The link defines the mean structure.The family defines the distribution of Y around that mean.

7. Why We Need a Family at All

In theory, the linear predictor gives us:

E [ Y ∣ X ]

But in practice, for the same X value, we observe many different Y values.

Example:

  • Same treatment

  • Same covariates

  • Different patients→ Different outcomes

This spread of values is what creates a distribution.

The family answers the question:

“Given the mean predicted by the linear predictor, how does Y behave?”

8. Families Reflect How Outcomes Behave in the Real World

Clinical outcomes are not arbitrary. Over decades, we have learned that they fall into a small number of natural outcome types, each with its own distribution.

Gaussian (Normal) Family

Used when Y is continuous and symmetric.

  • Blood pressure

  • Weight

  • Lab values

Here:

  • Y can take any real value

  • Errors are roughly symmetric around the mean

  • Variability is described by a standard deviation

This is why continuous outcomes are modeled with a Gaussian family.

Binomial Family

Used when Y is binary (0/1, yes/no, risk, odds).

  • Disease vs no disease

  • Death vs survival

Here:

  • Each observation is an event or a non-event

  • The mean represents a probability

  • Variability follows a Bernoulli / binomial process

This is why risks, odds, and probabilities use the binomial family.

Poisson Family

Used when Y is a count or rate.

  • Number of events

  • Incidence per person-time

Here:

  • Y takes non-negative integers (0, 1, 2, …)

  • Variability increases with the mean

  • Events are assumed to occur independently over time

This is why event counts and incidence rates use the Poisson family.

Negative Binomial Family

Used when the count data are more variable than Poisson allows.

  • Same outcome type as Poisson

  • But the variance is larger than the mean

This family exists because real clinical data are often messier than theory.

9. Family Is Not About “This Dataset Only”

A key point in your idea is very important:

The family is not chosen just because of the current dataset.

Instead:

  • It reflects how this type of outcome behaves in general

  • Across studies

  • Across populations

  • Across repeated measurements

So we do not say:

“My data look Gaussian, so I choose Gaussian.”

We say:

“This outcome is continuous by nature, so Gaussian is appropriate.”

This is a conceptual decision, not a cosmetic one.

10. Putting Link and Family Together

Now everything connects:

  1. Start with the real-world outcome

    • Continuous → Gaussian

    • Binary → Binomial

    • Count / rate → Poisson / Negative binomial

  2. Choose the scientific effect

    • Difference → Identity link

    • Ratio → Log link (ln)

    • Odds → Logit link

  3. Build the linear predictor


link (E[Y]) = α + βX

4. Use the family to describe how Y varies around that mean


Summary

The link function transforms the outcome so that reality becomes linear and can be written as a linear predictor. The family specifies the probability distribution that governs how observed outcomes vary around that linear mean. Together, link + family define the full statistical model.

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page