Normality Test: Shapiro–Wilk, Lilliefors, SD vs. Mean Rule, Pearson r, and MAD
- Mayta
- Jul 22
- 9 min read
Comparison Table of Normality Assessment Methods
🔍 M1 – Shapiro–Wilk Test
📍 When to Use It (Indications)
Small to moderate samples (n < 5000):
Especially optimal for small datasets (n < 50).
Ideal during preliminary checks before running t-tests, ANOVA, or parametric regression models.
When assumptions about normality matter:
Used when modeling continuous variables as outcomes or predictors.
Assessing lab values, scores, or clinical endpoints:
Common in clinical trials and biomarker studies where means or SDs drive decisions.
⚙️ How It Works – Behind the Scenes
The Shapiro–Wilk test compares:
Your actual data (sorted) to
What normally distributed data should look like at each rank.
It builds a test statistic, W, which summarizes:
Numerator: How well your sorted data aligns with ideal normal values.
Denominator: How much variation (spread) there is in your sample.
A W close to 1 means your data is tightly aligned with the expected pattern of a normal distribution.
A W much less than 1 means the data deviates — due to skew, outliers, or fat tails.
The p-value is derived from simulation, not a formula — it tells how likely it is to see your W by chance if the data were normal.
🧠 Now in Plain Language
Picture lining up your values from lowest to highest.
A “normal” distribution has a predictable pattern: the middle value near the mean, and symmetry on both sides.
Shapiro–Wilk checks how closely your data follows this ideal shape.
If your curve “bends” due to skew or heavy tails, it shows up as a drop in the W score, and the test flags non-normality.
✅ Pros
Feature | Benefit |
Very powerful in small samples | Best-in-class for detecting non-normality with n < 50 |
Captures subtle deviations | Sensitive to skewness and kurtosis |
Widely available | Built into R (shapiro.test()), SPSS, Python, etc. |
Exact distribution | Uses simulation-based p-values instead of large-sample approximations |
❌ Cons
Limitation | Impact |
Overly sensitive at large n | May reject normality due to tiny, irrelevant deviations if n > 5000 |
Univariate only | Doesn’t test multivariate normality (e.g., for MANOVA) |
Assumes continuous data | Poor performance with many tied or rounded values |
Doesn’t diagnose type of deviation | Flags abnormality but not whether it’s skew, bimodal, or heavy-tailed |
🩺 Clinical Analogy
Think of a nurse checking a patient’s pulse pattern against a reference ECG.
If every beat falls right on the expected pattern → normal (W ~ 1).
If some beats are early, late, or erratic → abnormal rhythm (W < 1).
Shapiro–Wilk plays the same role for your data’s shape — spotting subtle irregularities in statistical “rhythm.”
🔍 M2 – Lilliefors Test (a smarter Kolmogorov–Smirnov test)
📍 When to Use It (Indications)
Moderate-to-large samples (n > 50) where you suspect:
You need to test normality, but the true population mean and SD are not known a priori.
You're about to run regression models or ANOVA, and want to check that residuals or continuous predictors are normal-like.
Real-world data: clinical or biomarker data that likely doesn't follow textbook distributions.
Use in simulation pipelines, especially when evaluating robustness of new normality classifiers.
⚙️ How It Works – Formula Intuition (Light)
Think of this as a shape comparison test.
First, calculate your empirical cumulative distribution function (ECDF). This is just:
“At each point, what % of the data lies below or at this value?”
Next, assume your data is normal—but you estimate the mean and SD from your sample.
Now generate the theoretical normal CDF based on those estimates (like a reference curve).
The test statistic D is the biggest vertical distance between the ECDF (your data) and the reference CDF (ideal normal):
Where:
Fn(x) = your ECDF at x
Φ(x;μ^,σ^) = CDF of normal distribution with estimated mean μ^ and SD σ^
If this maximum difference D is too large, the test says:
“Your data deviates from normality more than we’d expect by chance.”
🧠 Now in Plain English (No Formulas)
You line up your data from smallest to largest and draw a step-wise curve showing how it accumulates.
You overlay a perfect bell-shaped curve, stretched to fit your data’s mean and SD.
You measure the biggest gap between your step-curve and the ideal curve.
If the gap is too big, your data isn't normally distributed.
✅ Pros
Feature | Benefit |
No need to know population mean/SD | Perfect for real-world data where parameters are unknown |
Good for mid-to-large n | Robust and interpretable even with 1000+ observations |
Sensitivity in tails | Picks up deviations at extreme high or low values |
Graph-friendly | Pairs well with ECDF plots for visual inspection |
❌ Cons
Limitation | Impact |
Less powerful than Shapiro–Wilk for small n | May fail to detect subtle skewness if n < 50 |
Not designed for multivariate normality | Just like Shapiro, it's univariate |
Over-sensitive with large n | Like most tests, it may flag trivial differences when n is very large |
Less intuitive p-value logic | Simulated critical values needed for significance thresholds |
🩺 Clinical Analogy
Suppose you have 100 patients’ fasting glucose levels.
You sort them and draw a “stair-step” curve showing how many patients fall below each glucose level.
You lay over a smooth “healthy patient” bell curve.
If your actual data steps away too far from the smooth ideal? Something’s off—maybe skew, heavy tails, or outliers.
That’s what Lilliefors is checking.
🔍 M3 – Heuristic Rule: 2 × SD < Mean
📍 When to Use It (Indications)
Quick screening in clinical datasets:
Commonly applied to positively skewed, non-negative variables like:
Length of Stay (LOS)
Hospital charges
Biomarkers (e.g., CRP, D-dimer)
Patient-reported scores (on unipolar scales)
When graphical tools or formal tests aren’t feasible:
Used in exploratory analysis, dashboards, or pre-modeling pipelines for triage.
As a flag before transformation:
Indicates when a variable may need log transformation or non-parametric handling.
⚙️ How It Works – Intuition & Calculation
This rule assumes a basic property of right-skewed distributions: their spread (SD) is large compared to their central tendency (mean).
Formula:
If 2×SD<Mean⇒Likely Normal or Left-Skewed\text{If } 2 \times \text{SD} < \text{Mean} \Rightarrow \text{Likely Normal or Left-Skewed} If 2×SD≥Mean⇒Right-Skewed\text{If } 2 \times \text{SD} \geq \text{Mean} \Rightarrow \text{Right-Skewed}
Why “2×SD”? Because in a normal distribution:
About 95% of data lies within ±2 SDs.
So if 2×SD exceeds the mean, it suggests the data's mass is heavily pulled to the right, not symmetrically spread around the mean.
🧠 Now in Plain Language
Let’s say you measure LOS in ICU:
Mean = 8 days
SD = 5 days
2 × SD = 10 → 10 > 8 → This suggests right-skew
It doesn’t test normality rigorously. Instead, it waves a yellow flag:
“Watch out — this might not be symmetric; better check the distribution.”
Think of it as a quick visual cue in numeric form.
✅ Pros
Feature | Benefit |
Ultra-fast to compute | No coding or stats required — just two values |
Good for pre-modeling screening | Flags likely-skewed variables before regression or comparison tests |
Clinically interpretable | Taps into real-world understanding of variable distributions (e.g., LOS rarely looks like a bell curve) |
❌ Cons
Limitation | Impact |
Not a statistical test | Doesn’t give p-values or control for sample size |
Fails in symmetric but non-normal data | Can misclassify if tails are heavy or the distribution is bimodal |
Breaks on near-zero or negative values | Mean close to 0 or data with negative values yields nonsense |
Overly sensitive to outliers | A few large values can distort the SD and mislead the rule |
🩺 Clinical Analogy
Imagine you’re estimating ICU LOS across patients:
If the average stay is 6 days and 2×SD is 14, something’s off — some patients must be staying much longer, dragging the curve to the right.
This quick rule helps you spot that asymmetry without needing a histogram.
🔄 Summary
Not a test—a flag.
Great in data-rich clinical environments to triage variables.
Useful prelude to proper tools (e.g., Shapiro–Wilk, transformations).
🔍 M4a – Correlation-Based QQ Alignment (Pearson r ≥ 0.95)
📍 When to Use It (Indications)
When you want an algorithmic version of QQ plot interpretation:
Converts the visual judgment of "normal-looking" plots into a quantitative cutoff.
Medium-to-large datasets (n ≥ 30):
Especially useful when Shapiro–Wilk becomes over-sensitive or you want to bypass p-value pitfalls.
When batch-scanning many variables:
Can be run on hundreds of variables automatically, flagging which ones are far from normal by rank-based alignment.
As a validation step for normality transformations:
Helps confirm whether log-transform, Box-Cox, or Winsorization brought a variable closer to symmetry.
⚙️ How It Works – Formula Logic
This method compares:
Quantiles of your sample (the sorted values)
With quantiles of a standard normal distribution (i.e., ideal bell curve values)
Then calculates:
Where:
Qsample: quantiles of your actual data
Qnormal: theoretical quantiles under N(0,1)
A high Pearson r (≥ 0.95) means:
Your data lies almost perfectly along the expected QQ plot line → likely normal
🧠 Now in Plain Language
You line up your data from smallest to largest.
You ask: “Where would these values fall if they were truly from a normal distribution?”
You draw both sets as dots — actual vs. expected.
Then: how straight is that line?
If it’s perfectly straight → correlation = 1.
If it’s bent/skewed → correlation drops.
You decide:
“If r ≥ 0.95, I’ll treat this as normal.”
It’s like putting a straight-edge ruler over your QQ plot and checking the fit — but with math.
✅ Pros
Feature | Benefit |
No p-value overinterpretation | Focuses on pattern, not statistical significance |
Visual + numeric hybrid | Replaces subjective QQ plot reading with objective threshold |
Great for automation | Can be used in pipelines screening 100+ variables |
Not disrupted by minor outliers | Moderate robustness when compared to variance-sensitive tests |
❌ Cons
Limitation | Impact |
Arbitrary threshold (0.95) | No universal justification — may need tuning by domain |
Not a test → no p-value | Can't say how unlikely the observed r is under true normality |
Only tests linearity | Doesn’t distinguish skew vs kurtosis problems — both just lower r |
Sample-size sensitivity | In very small samples, correlation may be high by chance; in large samples, tiny deviations lower r |
🩺 Clinical Analogy
Think of checking whether a patient’s BP readings follow a standard 24-hour circadian pattern:
If they follow the expected ups/downs closely → high correlation → “normal rhythm”
If the pattern is jagged, inconsistent → correlation drops → abnormal rhythm
This method checks your data’s rhythm against the ideal bell curve.
🧪 Threshold Justification
The r ≥ 0.95 threshold is empirical: it balances specificity vs sensitivity based on simulation studies.
It’s not sacred — you can adjust:
r ≥ 0.97 → stricter
r ≥ 0.93 → more lenient
🔍 M4b – AUC-Based Deviation from QQ Line(MAD ≤ 0.15)
📍 When to Use It (Indications)
When you're analyzing data prone to outliers:
Ideal for healthcare cost, ICU stay, biomarker spikes — where extreme values exist, but you want to focus on the overall pattern.
**When you want to quantify the shape mismatch in QQ plots:
Rather than asking “is the line straight?” (M4a), this asks “how far off are these points, on average?”
For visual normality assessments you want to standardize:
Use this to replace human-rater variability in training sets or simulation pipelines.
Useful in NLP/ML pipelines where robustness matters:
When building risk scores, lab data normality affects modeling choice — this method can help automate preprocessing logic.
⚙️ How It Works – Formula Logic
1.Standardize your data:
Where:
xi = original data
xˉ = sample mean
s = SD
2. Generate theoretical normal quantiles:
Sort your standardized values z(i)
Compare to expected quantiles from a standard normal distribution (e.g., via qnorm() in R)
3. Compute mean absolute deviation from the line:
4. Decision rule:
If MAD ≤ 0.15 → distribution considered “close enough” to normal
Else → potentially non-normal
🧠 Now in Plain Language
You scale your data to behave like a standard normal bell curve (mean 0, SD 1).
Then you ask: “If this were truly normal, where should each value fall?”
You measure how far off each data point is — take the average of those deviations.
If, on average, data points are very close (≤ 0.15) to where they should be, you say:
“This variable walks the line — it’s close enough to normal.”
Think of this as checking how much “wobble” your data has around the perfect QQ line.
✅ Pros
Feature | Benefit |
Outlier-resistant | Uses absolute (not squared) differences — less distortion from extreme values |
Good at detecting curvature | Picks up on subtle S-shapes or U-bends in QQ plots |
Numerically stable | Doesn’t depend on correlation, variance, or p-values |
Straightforward threshold | MAD ≤ 0.15 is intuitive and consistent |
❌ Cons
Limitation | Impact |
No formal test | No p-value or simulation-based cutoff; it’s a heuristic threshold |
Threshold (0.15) is empirical | Based on calibration to human rating — needs justification in new domains |
Assumes linearity is ideal | May misflag “perfectly symmetric but heavy-tailed” data as non-normal |
Requires standardized data | Can't use on raw units; adds preprocessing step |
🩺 Clinical Analogy
Think of comparing a patient’s ECG trace to a healthy standard:
You check how far off each beat is from the expected line — not whether the pattern is straight, but how wobbly it is.
If the average deviation is small, you say:
“Looks good overall — no clinical concern.”
This method measures that “average wobble” — point-by-point mismatch from normality.
🔬 Threshold Justification
The 0.15 cutoff comes from pilot visual studies aligning MAD with expert ratings.
Adjusting it:
MAD ≤ 0.10 → stricter, more false positives
MAD ≤ 0.20 → looser, fewer rejections




