How to Critically Appraise a Systematic Review (SR) and Meta-Analysis (MA): Using AMSTAR 2, GRADE, and CINeMA

Mayta
Jun 3, 2025
4 min read

Introduction

Systematic reviews and meta-analyses are central to evidence-based clinical decision-making. They synthesize findings from multiple studies to answer focused questions about interventions, diagnostics, or prognostics. However, not all reviews are methodologically sound. A poorly conducted or misinterpreted review may mislead clinicians and harm patients. Therefore, critical appraisal is essential—it's the clinician’s or researcher’s responsibility to verify both the credibility of the review's methods and the confidence in its conclusions.

This guide explains how to evaluate systematic reviews and meta-analyses using internationally accepted tools and principles, focusing on the AMSTAR 2 framework for conduct quality and the GRADE or CINeMA approach for assessing certainty in the findings.

Step 1: Assessing Credibility – How Well Was the Review Conducted?

AMSTAR 2: The Conduct Checklist

AMSTAR 2 (A MeaSurement Tool to Assess Systematic Reviews) is a widely adopted appraisal instrument for both randomized and non-randomized studies. It evaluates 16 domains related to the review process but emphasizes seven critical domains that strongly influence the trustworthiness of a review.

The Seven Critical Domains of AMSTAR 2:

Pre-registered Protocol: Was the review protocol registered before starting, and were deviations explained?
Comprehensive Search: Did the authors use a detailed, exhaustive search strategy?
Justification of Exclusions: Was a list of excluded studies provided with justifications?
Risk of Bias (RoB) Assessment: Did the review evaluate risk of bias in included studies using appropriate tools?
Appropriate Meta-analytic Methods: Were the methods for data synthesis statistically sound?
RoB Consideration in Interpretation: Was the impact of study bias reflected in the authors’ conclusions?
Publication Bias Assessment: Did the review examine and discuss possible publication bias?

Each item is rated, but AMSTAR 2 does not produce an overall numeric score. Instead, the pattern of flaws determines the confidence level in the review’s conduct.

Grading Confidence in the Review's Conduct

Appraisal Result	Confidence Rating	Interpretation
No critical or 1 non-critical weakness	High	Review is valid and results are reliable
>1 non-critical weakness (no critical flaws)	Moderate	May provide valid results
≥1 critical flaw (± other weaknesses)	Low	May not provide valid conclusions
>1 critical flaw (± non-critical flaws)	Critically low	Results should not be trusted

If the review is rated “low” or “critically low,” there is no justification to assess its effect estimates further—credibility fails the threshold for interpretation.

Step 2: Assessing Confidence – How Much Trust in the Effect Estimates?

Even a well-conducted review may yield uncertain or misleading estimates if the primary studies are weak or the evidence base is inconsistent. Confidence in effect estimates must be appraised separately, using GRADE (for pairwise meta-analyses) or CINeMA (for network meta-analyses).

GRADE Framework (for Pairwise Meta-Analysis)

GRADE categorizes evidence certainty into four levels—high, moderate, low, very low—based on five domains:

Risk of Bias: Are the included studies prone to methodological flaws (e.g., poor randomization, missing data)?
Inconsistency: Are the results homogenous, or do studies point in different directions?
Indirectness: Does the evidence apply to the PICO question (population, intervention, comparator, outcome)?
Imprecision: Are the confidence intervals wide or uncertain?
Publication Bias: Is there a possibility that relevant studies were unpublished?

Upgrading of certainty is rare but possible in observational studies with strong effects, dose-response relationships, or when confounders would likely diminish observed effects rather than create them.

CINeMA (for Network Meta-Analysis)

CINeMA extends GRADE to handle the complexity of indirect and mixed comparisons in NMA. It evaluates six domains:

Within-study bias
Reporting bias
Indirectness
Imprecision
Heterogeneity
Incoherence (inconsistency between direct and indirect evidence)

Both GRADE and CINeMA yield final ratings in four levels and should be reported transparently, ideally using Summary of Findings tables.

Five Core Questions to Guide Confidence Assessment

When appraising confidence in effect estimates, ask:

How serious is the risk of bias across the body of evidence?
Are the results consistent across studies, or is there unexplained heterogeneity?
How precise are the results—do confidence intervals suggest a clear benefit or harm?
Do the results apply directly to my clinical population (same PICO)?
Is there evidence of reporting or publication bias?

Each “yes” strengthens confidence. Each “no” reduces it.

Practical Challenges in Critical Appraisal

Even with robust tools, appraisal is not always straightforward:

Subjectivity: Some domains (e.g., indirectness) rely on contextual judgment.
Heterogeneity: High variability may obscure real effects.
Conflicting Results: High-quality and low-quality studies may yield contradictory estimates.
Duplicate Reviews: Multiple systematic reviews on the same question may disagree, creating confusion.

Clinicians must weigh not just the score but the direction, strength, and consistency of evidence, and relate it to clinical context and patient values.

Conclusion

Critically appraising a systematic review is a two-stage process:

Credibility: Check whether the review methods are rigorous (AMSTAR 2).
Confidence: Evaluate the certainty of effect estimates (GRADE or CINeMA).

Together, these steps determine whether a review should influence patient care or health policy. By using structured tools and asking focused questions, clinicians and researchers can distinguish trustworthy reviews from flawed or misleading ones—and anchor clinical decisions in the best available evidence.

Key Takeaways

AMSTAR 2 assesses how well a review was conducted; seven domains are critical.
GRADE and CINeMA evaluate how much trust to place in the conclusions.
Reviews with critical flaws should not be used for decision-making.
Confidence depends on factors like risk of bias, precision, and applicability.
Appraisal tools guide judgment, but clinical reasoning must integrate values and context.