Step-by-Step Guide to Continuous Outcomes and Effect Measures in Network Meta-Analysis (NMA)
- Mayta

- 9 minutes ago
- 5 min read
0) Frame the question & define the continuous endpoint
What it is Specify PICO/PICOT and the exact continuous measure (units/scale, timing/visit window, endpoint vs change‑from‑baseline).
Why we do it Continuous outcomes are scale‑ and time‑sensitive. Clear definitions prevent mixing incompatible measures (e.g., different instruments or visits) and ensure clinical interpretability.
Core focus
Which construct (e.g., FEV₁, pain, HbA1c)?
Units and direction of benefit (higher‑is‑better vs lower‑is‑better).
Timepoint(s) (e.g., 12, 24, 52 weeks) and whether you analyze endpoint or change.
Typical outputs
A protocolized outcome dictionary (construct, unit, visit).
Pre‑declared direction of benefit and primary analysis scale.
1) Choose the effect measure: MD vs SMD (or alternatives)
What it is Pick one primary effect measure for synthesis:
Mean Difference (MD) when all studies use the same scale (best for interpretability).
Standardized Mean Difference (SMD; Hedges g) when studies use different instruments for the same construct.
Less common: Ratio of Means, log‑MD (for skewed measures).
Why we do it MD keeps clinical units; SMD allows pooling across different scales; choice determines interpretability and comparability.
Core focus
Prefer MD whenever possible; only use SMD when scales truly differ.
If SMD is used, plan back‑translation (e.g., multiply by a representative SD or relate to the MID).
Typical outputs
Declared primary effect (MD or SMD) with rationale.
A back‑translation plan if using SMD.
2) Build analyzable contrasts from arm‑level data
What it is From each trial arm you have mean (ȳ), SD, n. Convert to study contrasts (e.g., MD = mean₁−mean₂, SE[MD]) or use reported (L)S mean differences consistently.
Why we do it Contrast‑based inputs are the common currency for meta‑analysis and NMA; they also handle multi‑arm trials correctly.
Core focus
Endpoint vs change scores: if using change, you may need (or impute) the pre–post correlation to derive SD of the change.
Derive missing SDs from CI/SE/p‑values when needed (document assumptions).
Identify multi‑arm trials and preserve within‑trial correlations (avoid double‑counting shared controls).
Check for skew; consider transformations for heavily skewed measures.
Typical outputs
A “contrast sheet” per comparison: effect (MD or SMD), SE, treatment labels, study ID, timepoint.
3) Fit the synthesis model (start with random‑effects)
What it is Pool contrasts using random‑effects (pairwise) or fit a random‑effects NMA for multi‑treatment settings.
Why we do it Between‑study differences are common for continuous outcomes; random‑effects captures τ² (true heterogeneity).
Core focus
τ² estimator (REML or Paule–Mandel preferred over classic DL).
Choose a reference treatment for presentation (doesn’t change the underlying network).
Check model plausibility and convergence.
Typical outputs
Pooled effects (MD/SMD) vs reference with 95% CIs.
τ² estimate and a forest plot vs reference.
4) Assess heterogeneity (within‑comparison variability)
What it is Quantify how much effects vary across studies assessing the same contrast.
Why we do it High heterogeneity weakens a single pooled estimate and suggests effect modifiers (e.g., baseline severity, dosing, visit timing).
Core focus
Cochran’s Q (p‑value), I² (%) (≈25/50/75 = low/moderate/high), and τ² on the analysis scale.
Visual study‑level forest plots for key comparisons.
Pre‑planned exploration (population, co‑therapies, visit windows, risk of bias).
Typical outputs
Q, I², τ²; a narrative on likely drivers; a plan for subgroups or meta‑regression if warranted.
5) Check transitivity & consistency (network validity)
What it is
Transitivity: comparable distributions of effect modifiers across comparisons (e.g., baseline value of the outcome, biomarker status, visit time, co‑interventions).
Consistency: direct and indirect evidence agree.
Why we do it NMA’s credibility rests on these assumptions; otherwise all‑pairs conclusions and ranks are unreliable.
Core focus
Summarize effect‑modifier distributions across the network (tables/plots).
Global consistency (design‑by‑treatment/incoherence tests).
Local consistency (node‑splitting: direct vs indirect for specific pairs).
If violated: consider stratified networks (e.g., by timepoint), meta‑regression, or cautious interpretation.
Typical outputs
Transitivity evidence (descriptive balance of modifiers).
Global test results; node‑split results with p‑values and direction of discrepancy.
Documented adjustments or stratified analyses if needed.
6) Rank treatments: SUCRA / P‑score & rankograms (for continuous outcomes)
What it is Translate the network’s estimates + uncertainty into a hierarchy:
Rank probabilities (probability of being 1st, 2nd, …, kth).
SUCRA (0–1) or P‑score summarize “how high” a treatment ranks overall.
Rankograms visualize each treatment’s rank probability distribution.
Why we do it Clinicians need a hierarchy, but we must also show uncertainty, not just a single rank.
Core focus
Set the direction correctly:
If higher is better (e.g., FEV₁ ↑), treat smaller values as undesirable in ranking settings.
If lower is better (e.g., pain ↓), treat smaller values as desirable.
Never replace effect sizes and CIs with ranks; small SUCRA gaps are rarely meaningful on their own.
Typical outputs
Table of SUCRA/P‑scores.
Rankograms (and cumulative rankograms) per treatment to show uncertainty.
7) Comparative displays: league table, network plot, forest vs reference
What it is Turn the model into decision‑ready visuals:
League table: all pairwise MD/SMD with 95% CIs.
Network plot: nodes (treatments) and edges (direct trials), node size ∝ total n, edge width ∝ evidence amount.
Forest vs reference: simple clinical read of effects.
Why we do it Stakeholders must quickly answer “A or B?” and see where evidence is direct vs indirect.
Core focus
League directionality (know whether entries are “column vs row”).
Order rows/columns by SUCRA/P‑score only as a presentation aid (ranks ≠ certainty).
Explain network balance (star‑shaped vs richly connected).
Typical outputs
A league matrix (MD when units match; SMD otherwise).
A weighted network graph.
A forest plot vs reference in clinical units (MD) or unit‑free (SMD).
8) Small‑study effects / publication bias
What it is Assess asymmetry for continuous outcomes (pairwise or comparison‑adjusted funnel in NMA) and consider Egger‑type tests when k is adequate (>10).
Why we do it Selective reporting and small studies can bias continuous outcomes (especially when SD imputation or different instruments are involved).
Core focus
Visual funnel symmetry; keep in mind low power with few studies.
Consider substantive reasons for asymmetry (measurement protocols, visit windows, imputed SDs).
Typical outputs
(Comparison‑adjusted) funnel plot for NMA; a brief test result if appropriate.
A reasoned narrative on plausibility and impact.
9) Contributions, sensitivity, and certainty of evidence
What it is Make results auditable and robust:
Contribution matrix: which direct comparisons drive each network estimate.
Sensitivity ladders: remove high‑risk‑of‑bias trials; re‑estimate with alternative τ²; exclude studies with imputed SDs; stratify by timepoint, baseline level, or instrument family (for SMD).
Certainty: overall confidence considering risk of bias, imprecision, inconsistency, indirectness,and small‑study effects.
Why we do it Transparent evidence flow + robustness checks → credible decisions.
Core focus
Target sensitivities to the dominant contributors.
If using SMD, provide back‑translation (e.g., to a representative instrument or to an MID) to regain clinical meaning.
Summarize certainty (e.g., high/moderate/low/very low) with reasons.
Typical outputs
Contribution heatmap/table identifying key drivers.
Sensitivity tables/plots and narrative on stability.
A certainty/credibility summary that accompanies the main findings.
Quick reference: decisions you must lock down early (and keep consistent)
Endpoint vs change (and, if change, how you handle the pre–post correlation).
Effect scale: MD (prefer) vs SMD (only if instruments differ).
Direction of benefit (“higher is better” vs “lower is better”) — affects ranking settings and plot labels.
Primary timepoint (avoid mixing visits unless you stratify or model time).
Random‑effects as default; specify τ² estimator.
Plan for transitivity/consistency diagnostics, ranks with uncertainty, and contribution‑guided sensitivity.





Comments