Choosing the Right Generalized Linear Models (GLMs) in Stata: A DEPTh-Based Guide

Mayta
Jun 11
4 min read

Updated: Aug 19

Outcome Type (Y)	GLM Family	Link Function	Common X Type	Effect Estimate	Assumption About Normality in Y
Continuous	gaussian	identity	Continuous / Categorical	Mean difference	• Normality applies to residuals (errors of Y given X), not Y itself • Residuals ≈ normal distribution • Residuals ≈ constant variance (homoskedasticity)
Binary (0/1)	binomial	logit	Continuous / Categorical	Odds ratio	• No normality assumption • Assumes linearity of X in the logit of Y
Binary (0/1)	binomial	log	Continuous / Categorical	Risk ratio	• No normality assumption • Assumes correct binomial variance
Binary (0/1)	poisson	log	Continuous / Categorical	Risk ratio (robust)	• No normality assumption • Assumes Poisson mean–variance, but robust SEs relax this

✅ What must be (approximately) normal?

Residuals (ε = Y − Ŷ)
- They should follow a normal distribution with mean 0 and constant variance (homoskedasticity).
- This is what makes confidence intervals and p-values valid.

❌ What does not need to be normal?

Predictors (X): Can be skewed, categorical, or binary. No normality assumption.
Outcome (Y) itself: Does not need to be normal, only its residuals after fitting the model.

📌 Example in Stata (Linear GLM)

glm sbp age bmi, family(gaussian) link(identity)
predict resid, resid
hist resid, normal   // check normality
rvfplot              // check homoskedasticity

📌 เปรียบเทียบง่าย ๆ

Y (Outcome จริง)→ ข้อมูลดิบที่เราวัดมา เช่น ความดัน, น้ำหนัก, ตาย/รอด→ อาจเบ้ (skewed), มีค่า outlier, ไม่ปกติ → ไม่ใช่ปัญหา
Residuals (Y − Ŷ)→ สิ่งที่เหลือหลังจากโมเดลอธิบาย→ ต้องการ ≈ ปกติ (เฉพาะ Gaussian/linear regression)→ ใช้ตรวจสอบ assumption ของโมเดล

✳️ Why This Matters

If you’ve typed or seen something like this in Stata:

stata: glm y x, fam(bin) link(log)

...and felt unsure what it really means—you’re not alone.

This tiny line holds powerful logic for clinical research. It tells Stata:

“Model the chance of an outcome (y) depending on exposure (x), assuming the outcome behaves like a binomial (yes/no) event, and relate them through a logarithmic scale.”

🧠 The Big Idea

The command structure is:

stata: glm <outcome> <explanatory variables>, fam(<distribution>) link(<scale>)

Each part has meaning:

Part	Stata Syntax Example	What It Says in Plain English
glm	glm	Use a generalized linear model
<y>	dead	The outcome variable (e.g., died or survived)
<x>	treatment	The predictor/exposure (e.g., Drug A vs B)
fam(bin)	fam(bin)	Outcome is binary (0/1)
link(log)	link(log)	Use a logarithmic scale for modeling the risk

🔍 The “Family”: What Is fam()?

The fam() option tells Stata what type of data your outcome variable is:

Family (fam)	Use for…	Clinical Examples
binomial	Yes/No outcomes	Survived/Died, Cured/Not, HIV+/–
gaussian	Continuous outcomes	BP, Weight, Lab values
poisson	Count outcomes	ER visits, Infections, Seizures
gamma	Skewed positive continuous	Hospital cost, Length of stay

📌 Think: "What does my outcome variable look like?"

🔗 The “Link”: What Is link()?

The link() option tells Stata how to mathematically connect your predictor (x) to your outcome (y):

Link Function	What It Models	Use When You Want…
logit	Log-odds	Odds Ratio (OR)
log	Log-risk	Risk Ratio (RR), Incidence Ratio
identity	Direct difference in risk	Risk Difference (RD), mean change

📌 Think: "What do I want to report to clinicians or policymakers?"

🧪 Common Stata GLM Combos for Clinical Research

Research Goal	Use This GLM Syntag	Interprets Output As...
Estimate Odds Ratio	glm y x, fam(bin) link(logit)	Odds ratio (good for case-control)
Estimate Risk Ratio	glm y x, fam(bin, gaussian) link(log)	Risk ratio (cohort/RCTs)
Estimate Risk Difference	glm y x, fam(bin, gaussian) link(identity)	Absolute % difference
Compare Means	glm y x, fam(gaussian) link(identity)	Mean difference (like regression)
Estimate IRR (rate ratio)	glm y x, fam(poisson) link(log)	Incidence rate ratio

🧭 Mnemonic: "FAMILY is the nature of Y. LINK is how X affects Y."

Family = “What kind of variable is the outcome?” → Binary? Count? Continuous?
Link = “How do we relate exposure to the outcome?” → Ratio? Difference?

🔁 Combine them based on your study question, data structure, and clinical meaning.

🧠 Examples in Words (No Code!)

“I want to know if Drug A reduces mortality compared to Drug B in ICU patients.”
- Outcome: Death (yes/no) → Binary → fam(bin)
- Measure: Risk ratio preferred (not odds) → link(log)
- Use: glm dead drug, fam(bin) link(log)
“How many ER visits do asthma patients have after new inhaler vs old one?”
- Outcome: ER visit count → Count → fam(poisson)
- Compare rates → link(log)
- Use: glm visits inhaler, fam(poisson) link(log)
“Does the new diet change average HbA1c levels?”
- Outcome: HbA1c (numeric) → Continuous → fam(gaussian)
- Want mean difference → link(identity)
- Use: glm a1c diet, fam(gaussian) link(identity)

✅ ถ้า Binary Outcome (เช่น ตาย/รอด, ป่วย/ไม่ป่วย)

GLM (logit หรือ log link) สร้าง "เส้น" หรือ "สมการ" ที่อธิบายว่า:

เมื่อค่าของตัวแปรอิสระ (X) เพิ่มขึ้น → โอกาสที่ outcome จะเป็น 1 (เกิดเหตุการณ์) ก็จะเพิ่มขึ้นหรือลดลง ขึ้นกับ sign ของ coefficient

🔸 ตัวอย่าง:

glm died age bmi, family(binomial) link(logit)

ถ้า age มี coefficient บวก → อายุมากขึ้น = โอกาสตายมากขึ้น
แบบจำลองนี้ประเมิน ความสัมพันธ์ระหว่าง X กับโอกาสที่ Y=1 (ไม่ใช่ค่าของ Y โดยตรง)

✅ ถ้า Continuous Outcome (เช่น ความดัน, น้ำหนัก)

GLM (identity link) จะสร้างสมการเชิงเส้น:

เมื่อค่าของ X เพิ่มขึ้น → ค่าเฉลี่ยของ outcome (Y) ก็จะเพิ่มขึ้นหรือลดลง ตาม beta

🔸 ตัวอย่าง:

glm sbp age bmi, family(gaussian) link(identity)

ถ้า bmi มี coefficient = 2.1 → BMI เพิ่มขึ้น 1 หน่วย = SBP เพิ่มเฉลี่ย 2.1 mmHg

🔁 เปรียบเทียบความเข้าใจ:

ประเภท Outcome	ตัวแบบ (Model)	ความสัมพันธ์
Binary	binomial + logit/log	X เพิ่ม → เพิ่ม “โอกาส” ที่ Y=1
Continuous	gaussian + identity	X เพิ่ม → เพิ่มค่าเฉลี่ยของ Y

📌 สรุป (แบบภาพจำ)

🔹 Binary: "X มากขึ้น → โอกาส เป็น 1 มากขึ้น" 🔸 Continuous: "X มากขึ้น → ค่าของ Y มากขึ้น"

✅ Key Takeaways for Clinicians

fam() = What kind of outcome? (binary, continuous, count, skewed)
link() = What measure do you want? (OR, RR, RD, IRR, mean diff)
Don’t default to odds ratio unless that’s your actual goal.
Use log link for intuitive risk ratios in cohort and RCTs.
Use identity link for absolute differences—great for policies.

🧪 Practice Challenge

Q: You run an RCT and want to estimate the risk ratio for infection in patients treated with Antibiotic A vs B. Infection is a yes/no outcome.

A: Your syntax is:

stata : glm infected drug, fam(bin) link(log)

This tells Stata:

Outcome = binary
Link = log (for risk ratio)