Normality Test: Shapiro–Wilk, Lilliefors, SD vs. Mean Rule, Pearson r, and MAD

Mayta
Jul 22
9 min read

Comparison Table of Normality Assessment Methods

🔍 M1 – Shapiro–Wilk Test

📍 When to Use It (Indications)

Small to moderate samples (n < 5000):
- Especially optimal for small datasets (n < 50).
- Ideal during preliminary checks before running t-tests, ANOVA, or parametric regression models.
When assumptions about normality matter:
- Used when modeling continuous variables as outcomes or predictors.
Assessing lab values, scores, or clinical endpoints:
- Common in clinical trials and biomarker studies where means or SDs drive decisions.

⚙️ How It Works – Behind the Scenes

The Shapiro–Wilk test compares:

Your actual data (sorted) to
What normally distributed data should look like at each rank.

It builds a test statistic, W, which summarizes:

Numerator: How well your sorted data aligns with ideal normal values.
Denominator: How much variation (spread) there is in your sample.

A W close to 1 means your data is tightly aligned with the expected pattern of a normal distribution.
A W much less than 1 means the data deviates — due to skew, outliers, or fat tails.

The p-value is derived from simulation, not a formula — it tells how likely it is to see your W by chance if the data were normal.

🧠 Now in Plain Language

Picture lining up your values from lowest to highest.
A “normal” distribution has a predictable pattern: the middle value near the mean, and symmetry on both sides.
Shapiro–Wilk checks how closely your data follows this ideal shape.
If your curve “bends” due to skew or heavy tails, it shows up as a drop in the W score, and the test flags non-normality.

✅ Pros

Feature	Benefit
Very powerful in small samples	Best-in-class for detecting non-normality with n < 50
Captures subtle deviations	Sensitive to skewness and kurtosis
Widely available	Built into R (shapiro.test()), SPSS, Python, etc.
Exact distribution	Uses simulation-based p-values instead of large-sample approximations

❌ Cons

Limitation	Impact
Overly sensitive at large n	May reject normality due to tiny, irrelevant deviations if n > 5000
Univariate only	Doesn’t test multivariate normality (e.g., for MANOVA)
Assumes continuous data	Poor performance with many tied or rounded values
Doesn’t diagnose type of deviation	Flags abnormality but not whether it’s skew, bimodal, or heavy-tailed

🩺 Clinical Analogy

Think of a nurse checking a patient’s pulse pattern against a reference ECG.

If every beat falls right on the expected pattern → normal (W ~ 1).
If some beats are early, late, or erratic → abnormal rhythm (W < 1).

Shapiro–Wilk plays the same role for your data’s shape — spotting subtle irregularities in statistical “rhythm.”

🔍 M2 – Lilliefors Test (a smarter Kolmogorov–Smirnov test)

📍 When to Use It (Indications)

Moderate-to-large samples (n > 50) where you suspect:
- You need to test normality, but the true population mean and SD are not known a priori.
- You're about to run regression models or ANOVA, and want to check that residuals or continuous predictors are normal-like.
Real-world data: clinical or biomarker data that likely doesn't follow textbook distributions.
Use in simulation pipelines, especially when evaluating robustness of new normality classifiers.

⚙️ How It Works – Formula Intuition (Light)

Think of this as a shape comparison test.

First, calculate your empirical cumulative distribution function (ECDF). This is just:

“At each point, what % of the data lies below or at this value?”

Next, assume your data is normal—but you estimate the mean and SD from your sample.
Now generate the theoretical normal CDF based on those estimates (like a reference curve).
The test statistic D is the biggest vertical distance between the ECDF (your data) and the reference CDF (ideal normal):

Where:

Fn(x) = your ECDF at x
Φ(x;μ^,σ^) = CDF of normal distribution with estimated mean μ^ and SD σ^
If this maximum difference D is too large, the test says:

“Your data deviates from normality more than we’d expect by chance.”

🧠 Now in Plain English (No Formulas)

You line up your data from smallest to largest and draw a step-wise curve showing how it accumulates.
You overlay a perfect bell-shaped curve, stretched to fit your data’s mean and SD.
You measure the biggest gap between your step-curve and the ideal curve.
If the gap is too big, your data isn't normally distributed.

✅ Pros

Feature	Benefit
No need to know population mean/SD	Perfect for real-world data where parameters are unknown
Good for mid-to-large n	Robust and interpretable even with 1000+ observations
Sensitivity in tails	Picks up deviations at extreme high or low values
Graph-friendly	Pairs well with ECDF plots for visual inspection

❌ Cons

Limitation	Impact
Less powerful than Shapiro–Wilk for small n	May fail to detect subtle skewness if n < 50
Not designed for multivariate normality	Just like Shapiro, it's univariate
Over-sensitive with large n	Like most tests, it may flag trivial differences when n is very large
Less intuitive p-value logic	Simulated critical values needed for significance thresholds

🩺 Clinical Analogy

Suppose you have 100 patients’ fasting glucose levels.

You sort them and draw a “stair-step” curve showing how many patients fall below each glucose level.
You lay over a smooth “healthy patient” bell curve.
If your actual data steps away too far from the smooth ideal? Something’s off—maybe skew, heavy tails, or outliers.

That’s what Lilliefors is checking.

🔍 M3 – Heuristic Rule: 2 × SD < Mean

📍 When to Use It (Indications)

Quick screening in clinical datasets:
- Commonly applied to positively skewed, non-negative variables like:
  - Length of Stay (LOS)
  - Hospital charges
  - Biomarkers (e.g., CRP, D-dimer)
  - Patient-reported scores (on unipolar scales)
When graphical tools or formal tests aren’t feasible:
- Used in exploratory analysis, dashboards, or pre-modeling pipelines for triage.
As a flag before transformation:
- Indicates when a variable may need log transformation or non-parametric handling.

⚙️ How It Works – Intuition & Calculation

This rule assumes a basic property of right-skewed distributions: their spread (SD) is large compared to their central tendency (mean).

Formula:

If 2×SD<Mean⇒Likely Normal or Left-Skewed\text{If } 2 \times \text{SD} < \text{Mean} \Rightarrow \text{Likely Normal or Left-Skewed} If 2×SD≥Mean⇒Right-Skewed\text{If } 2 \times \text{SD} \geq \text{Mean} \Rightarrow \text{Right-Skewed}

Why “2×SD”? Because in a normal distribution:

About 95% of data lies within ±2 SDs.
So if 2×SD exceeds the mean, it suggests the data's mass is heavily pulled to the right, not symmetrically spread around the mean.

🧠 Now in Plain Language

Let’s say you measure LOS in ICU:

Mean = 8 days
SD = 5 days
2 × SD = 10 → 10 > 8 → This suggests right-skew

It doesn’t test normality rigorously. Instead, it waves a yellow flag:

“Watch out — this might not be symmetric; better check the distribution.”

Think of it as a quick visual cue in numeric form.

✅ Pros

Feature	Benefit
Ultra-fast to compute	No coding or stats required — just two values
Good for pre-modeling screening	Flags likely-skewed variables before regression or comparison tests
Clinically interpretable	Taps into real-world understanding of variable distributions (e.g., LOS rarely looks like a bell curve)

❌ Cons

Limitation	Impact
Not a statistical test	Doesn’t give p-values or control for sample size
Fails in symmetric but non-normal data	Can misclassify if tails are heavy or the distribution is bimodal
Breaks on near-zero or negative values	Mean close to 0 or data with negative values yields nonsense
Overly sensitive to outliers	A few large values can distort the SD and mislead the rule

🩺 Clinical Analogy

Imagine you’re estimating ICU LOS across patients:

If the average stay is 6 days and 2×SD is 14, something’s off — some patients must be staying much longer, dragging the curve to the right.
This quick rule helps you spot that asymmetry without needing a histogram.

🔄 Summary

Not a test—a flag.
Great in data-rich clinical environments to triage variables.
Useful prelude to proper tools (e.g., Shapiro–Wilk, transformations).

🔍 M4a – Correlation-Based QQ Alignment (Pearson r ≥ 0.95)

📍 When to Use It (Indications)

When you want an algorithmic version of QQ plot interpretation:
- Converts the visual judgment of "normal-looking" plots into a quantitative cutoff.
Medium-to-large datasets (n ≥ 30):
- Especially useful when Shapiro–Wilk becomes over-sensitive or you want to bypass p-value pitfalls.
When batch-scanning many variables:
- Can be run on hundreds of variables automatically, flagging which ones are far from normal by rank-based alignment.
As a validation step for normality transformations:
- Helps confirm whether log-transform, Box-Cox, or Winsorization brought a variable closer to symmetry.

⚙️ How It Works – Formula Logic

This method compares:

Quantiles of your sample (the sorted values)
With quantiles of a standard normal distribution (i.e., ideal bell curve values)

Then calculates:

Where:

Qsample: quantiles of your actual data
Qnormal: theoretical quantiles under N(0,1)

A high Pearson r (≥ 0.95) means:

Your data lies almost perfectly along the expected QQ plot line → likely normal

🧠 Now in Plain Language

You line up your data from smallest to largest.
You ask: “Where would these values fall if they were truly from a normal distribution?”
You draw both sets as dots — actual vs. expected.
Then: how straight is that line?
- If it’s perfectly straight → correlation = 1.
- If it’s bent/skewed → correlation drops.
You decide:

“If r ≥ 0.95, I’ll treat this as normal.”

It’s like putting a straight-edge ruler over your QQ plot and checking the fit — but with math.

✅ Pros

Feature	Benefit
No p-value overinterpretation	Focuses on pattern, not statistical significance
Visual + numeric hybrid	Replaces subjective QQ plot reading with objective threshold
Great for automation	Can be used in pipelines screening 100+ variables
Not disrupted by minor outliers	Moderate robustness when compared to variance-sensitive tests

❌ Cons

Limitation	Impact
Arbitrary threshold (0.95)	No universal justification — may need tuning by domain
Not a test → no p-value	Can't say how unlikely the observed r is under true normality
Only tests linearity	Doesn’t distinguish skew vs kurtosis problems — both just lower r
Sample-size sensitivity	In very small samples, correlation may be high by chance; in large samples, tiny deviations lower r

🩺 Clinical Analogy

Think of checking whether a patient’s BP readings follow a standard 24-hour circadian pattern:

If they follow the expected ups/downs closely → high correlation → “normal rhythm”
If the pattern is jagged, inconsistent → correlation drops → abnormal rhythm

This method checks your data’s rhythm against the ideal bell curve.

🧪 Threshold Justification

The r ≥ 0.95 threshold is empirical: it balances specificity vs sensitivity based on simulation studies.
It’s not sacred — you can adjust:
- r ≥ 0.97 → stricter
- r ≥ 0.93 → more lenient

🔍 M4b – AUC-Based Deviation from QQ Line(MAD ≤ 0.15)

📍 When to Use It (Indications)

When you're analyzing data prone to outliers:
- Ideal for healthcare cost, ICU stay, biomarker spikes — where extreme values exist, but you want to focus on the overall pattern.
**When you want to quantify the shape mismatch in QQ plots:
- Rather than asking “is the line straight?” (M4a), this asks “how far off are these points, on average?”
For visual normality assessments you want to standardize:
- Use this to replace human-rater variability in training sets or simulation pipelines.
Useful in NLP/ML pipelines where robustness matters:
- When building risk scores, lab data normality affects modeling choice — this method can help automate preprocessing logic.

⚙️ How It Works – Formula Logic

1.Standardize your data:

Where:

x_i= original data
xˉ = sample mean
s = SD

2. Generate theoretical normal quantiles:

Sort your standardized values z(i)
Compare to expected quantiles from a standard normal distribution (e.g., via qnorm() in R)

3. Compute mean absolute deviation from the line:

4. Decision rule:

If MAD ≤ 0.15 → distribution considered “close enough” to normal
Else → potentially non-normal

🧠 Now in Plain Language

You scale your data to behave like a standard normal bell curve (mean 0, SD 1).
Then you ask: “If this were truly normal, where should each value fall?”
You measure how far off each data point is — take the average of those deviations.
If, on average, data points are very close (≤ 0.15) to where they should be, you say:

“This variable walks the line — it’s close enough to normal.”

Think of this as checking how much “wobble” your data has around the perfect QQ line.

✅ Pros

Feature	Benefit
Outlier-resistant	Uses absolute (not squared) differences — less distortion from extreme values
Good at detecting curvature	Picks up on subtle S-shapes or U-bends in QQ plots
Numerically stable	Doesn’t depend on correlation, variance, or p-values
Straightforward threshold	MAD ≤ 0.15 is intuitive and consistent

❌ Cons

Limitation	Impact
No formal test	No p-value or simulation-based cutoff; it’s a heuristic threshold
Threshold (0.15) is empirical	Based on calibration to human rating — needs justification in new domains
Assumes linearity is ideal	May misflag “perfectly symmetric but heavy-tailed” data as non-normal
Requires standardized data	Can't use on raw units; adds preprocessing step

🩺 Clinical Analogy

Think of comparing a patient’s ECG trace to a healthy standard:

You check how far off each beat is from the expected line — not whether the pattern is straight, but how wobbly it is.
If the average deviation is small, you say:

“Looks good overall — no clinical concern.”

This method measures that “average wobble” — point-by-point mismatch from normality.

🔬 Threshold Justification

The 0.15 cutoff comes from pilot visual studies aligning MAD with expert ratings.
Adjusting it:
- MAD ≤ 0.10 → stricter, more false positives
- MAD ≤ 0.20 → looser, fewer rejections