Choosing Link Functions and Families (Family) in GLMs: Outcome-Based
- Mayta

- Dec 12, 2025
- 5 min read
Outcome → Family → Link (Core GLM Mapping)
Outcome (Y) | What Y Is in the Real World | Family (Distribution) | Link | Effect Measure (Interpretation) |
Continuous | Numerical, symmetric (e.g. BP, lab values) | Gaussian | Identity | Mean Difference |
Continuous | Positive, right-skewed (ratio scale) | Gaussian (working model) | Log (ln) | Mean Ratio |
Binary (0/1) | Event vs no event | Binomial | Identity | Risk Difference |
Binary (0/1) | Event vs no event | Binomial | Log (ln) | Risk Ratio |
Binary (0/1) | Event vs no event | Binomial | Logit | Odds Ratio |
Count | Number of events | Poisson | Log (ln) | Count Ratio / IRR |
Rate | Events per person-time | Poisson | Log (ln) | Incidence Rate Ratio |
Count / Rate | Additive event change | Poisson | Identity | Count / Rate Difference |
Count (overdispersed) | Variance > mean | Negative Binomial | Log (ln) | IRR (robust to overdispersion) |
Binary (RR fallback) | Log-binomial fails | Poisson (working) | Log (ln) | Risk Ratio (robust SEs) |
How to Read This Table (One-Line Rule)
Outcome type determines the Family. Scientific question (difference vs ratio vs odds) determines the Link.
Mental Shortcut
Difference → Identity link (−)
Ratio → Log link (ln → ÷)
Odds → Logit link (log-odds)
Continuous → Gaussian
Binary → Binomial
Count / Rate → Poisson (or Negative Binomial if overdispersed)

In generalized linear models (GLM) and generalized estimating equations (GEE), the link function is the first and most important conceptual step. Before thinking about distributions or variability, we must first answer a simple question:
What do we need to do to the outcome (Y) so that it can be expressed as a linear formula?
This is because regression models rely on a linear predictor (LP) of the form:
LP = α + βX
Real-world clinical data—means, risks, rates, probabilities—are rarely linear by nature. The link function is what transforms the outcome (or its mean) so that this linear structure becomes possible.
1. The Role of the Link Function
The link function acts on Y (more precisely, on the expected value of Y) to make the relationship between X and Y compatible with a straight-line equation.
Conceptually:
The link tells us what mathematical operation we must apply to Y so that the formula fits the data.
Different scientific questions require different links because they imply different types of effects.
2. Identity Link: Difference (−)
When we use the identity link, we do nothing to the outcome.
The mean of Y is modeled directly.
The effect is additive.
Interpretation is based on subtraction.
For two groups or two values of X:
Effect = μ1 − μ0
This is why the identity link is used for:
Mean difference
Risk difference
Intuition
The relationship is already linear on the original scale, so no transformation is needed. The straight-line model works naturally.
3. Log Link (ln): Ratio (÷) via Log Properties
When the scientific question is about ratios rather than differences, the identity link no longer works. Risks, rates, and counts usually change multiplicatively, not additively.
This is where the log link (natural log, ln) is used.
On the log scale, the model becomes linear:
ln(E[Y]) = α + βX
For two groups:
ln(μ1) − ln(μ0)
Using the property of logarithms:
So subtraction on the log scale corresponds to division on the original scale.
After transforming back:
This is why the log link is used for:
Risk ratio
Rate ratio
Mean ratio
Intuition
Ratios are linear only after taking ln.The log link converts multiplicative real-world behavior into additive linear behavior.
4. Logit Link: Log Odds
Probabilities are bounded between 0 and 1 and follow an S-shaped (sigmoid) pattern, which is not linear.
The logit link transforms probability into log odds:
On the log-odds scale:
The range becomes −∞ to +∞
The relationship with X becomes linear
Subtraction corresponds to an odds ratio
This is why the logit link is used for:
Odds ratios in logistic regression
5. Why All of This Is Necessary
All three links—identity, log, and logit—exist for the same fundamental reason:
We must transform real-world outcomes into a linear form so that we can construct a linear predictor (LP).
Once the linear predictor exists: LP = α + βX
Then—and only then—can we estimate coefficients, test hypotheses, and interpret effects.
6. The Family: Where the Distribution Comes In
Once the link function has done its job—turning the real-world outcome into a linear predictor—we are still missing one essential part of the model:
Real data never lie exactly on the linear predictor.Observed Y values always vary around the expected value.
This variability is not random chaos. It follows recognizable statistical patterns.The family specifies which probability distribution describes that variability.
In other words:
The link defines the mean structure.The family defines the distribution of Y around that mean.
7. Why We Need a Family at All
In theory, the linear predictor gives us:
E [ Y ∣ X ]
But in practice, for the same X value, we observe many different Y values.
Example:
Same treatment
Same covariates
Different patients→ Different outcomes
This spread of values is what creates a distribution.
The family answers the question:
“Given the mean predicted by the linear predictor, how does Y behave?”
8. Families Reflect How Outcomes Behave in the Real World
Clinical outcomes are not arbitrary. Over decades, we have learned that they fall into a small number of natural outcome types, each with its own distribution.
Gaussian (Normal) Family
Used when Y is continuous and symmetric.
Blood pressure
Weight
Lab values
Here:
Y can take any real value
Errors are roughly symmetric around the mean
Variability is described by a standard deviation
This is why continuous outcomes are modeled with a Gaussian family.
Binomial Family
Used when Y is binary (0/1, yes/no, risk, odds).
Disease vs no disease
Death vs survival
Here:
Each observation is an event or a non-event
The mean represents a probability
Variability follows a Bernoulli / binomial process
This is why risks, odds, and probabilities use the binomial family.
Poisson Family
Used when Y is a count or rate.
Number of events
Incidence per person-time
Here:
Y takes non-negative integers (0, 1, 2, …)
Variability increases with the mean
Events are assumed to occur independently over time
This is why event counts and incidence rates use the Poisson family.
Negative Binomial Family
Used when the count data are more variable than Poisson allows.
Same outcome type as Poisson
But the variance is larger than the mean
This family exists because real clinical data are often messier than theory.
9. Family Is Not About “This Dataset Only”
A key point in your idea is very important:
The family is not chosen just because of the current dataset.
Instead:
It reflects how this type of outcome behaves in general
Across studies
Across populations
Across repeated measurements
So we do not say:
“My data look Gaussian, so I choose Gaussian.”
We say:
“This outcome is continuous by nature, so Gaussian is appropriate.”
This is a conceptual decision, not a cosmetic one.
10. Putting Link and Family Together
Now everything connects:
Start with the real-world outcome
Continuous → Gaussian
Binary → Binomial
Count / rate → Poisson / Negative binomial
Choose the scientific effect
Difference → Identity link
Ratio → Log link (ln)
Odds → Logit link
Build the linear predictor
link (E[Y]) = α + βX
4. Use the family to describe how Y varies around that mean
Summary
The link function transforms the outcome so that reality becomes linear and can be written as a linear predictor. The family specifies the probability distribution that governs how observed outcomes vary around that linear mean. Together, link + family define the full statistical model.






Comments