SOP: Resolving Split Hit Patterns in Microbial Identification with Statistical and Biological Confirmation

Mayta
Oct 21, 2025
4 min read

Updated: Oct 24, 2025

Goal: make a safe, defensible call (strain vs species) when top hits and lower hits disagree.

Inputs you need (from your search output)

For BLAST/aligners: E-value, bit score, query coverage (often qcovs), % identity, alignment length
For read mapping/WGS: breadth (% of gene covered), depth (× coverage) per marker gene
For MALDI-TOF: instrument “score category” (use vendor “high-confidence” tier as species-level; treat “low/borderline” as screening only)

Step 0 — Filter out noise (mandatory)

Keep a hit only if both are true:

E-value ≤ 1e-20
- If your query is short (<400 bp DNA or short peptide), make it stricter (≤ 1e-30).
Query coverage ≥ 70–80% (for organism-level interpretation)
- For marker genes, require breadth ≥ 80% and depth ≥ 10×.

Quick BLAST tip: export outfmt 6 with qcovs and bitscore so you can filter quickly: -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"

Step 1 — Identify the “top cluster”

Look at ranks 1–5 (up to 1–10) after filtering.
A top cluster = several consecutive hits with similar, high bit scores/E-values and similar coverage.
Find the boundary where scores clearly drop (commonly between rank 10 and 11).

Compute two numbers at the boundary:

Separation Factor (SF)

Example: if rank 10 E = 1e-88 and rank 11 E = 1e-60 → SF = 1e-60 / 1e-88 = 1e28.

Bit-score gap (ΔS)

Example: if bit scores are 210 (rank 10) and 188 (rank 11) → ΔS = 22.

Heuristics: SF ≥ 100 and/or ΔS ≥ 20 bits ⇒ top cluster is meaningfully stronger.

Excel tip (E-values in cells E10 and E11, bit scores in F10, F11):

SF: =E11/E10
ΔS: =F10 - F11

Step 2 — If the top cluster is consistent & separated

Criteria: SF ≥ 100 and/or ΔS ≥ 20 bits, coverage ≥ 80%, and %ID reasonable for the locus (DNA ≥95–97% for species-level; strain claims demand more).
Tentative statistical trust in the top cluster.
Do not name the strain yet—go to Step 3 (biology).

Step 3 — Demand biological confirmation (markers)

Look for strain-specific markers for the candidate strain. Example for E. coli O157:H7:

rfbE (O157), fliC-H7 (H7), stx1 / stx2, eae

Call a marker “present” only if all are true:

Identity ≥ 95–97%
Breadth ≥ 80% of the gene
Depth ≥ 10× (≥20× preferred for clinical decisions)
No conflicting best hit to a non-target allele

Interpretation:

Markers present (>=2 key markers) → you can name the strain (with a note of marker support).
Markers absent/weak → do not call strain. Report species-level and recommend confirmatory tests (PCR/serotype/phenotype).

Step 4 — If the top cluster is not clearly separated

Use Lowest Common Ancestor (LCA): report at species (e.g., Escherichia coli).
Avoid strain claims.
Consider more data:
- Longer reads, second locus, or WGS
- Re-isolation to rule out mixed sample
- Curated/non-redundant reference set (RefSeq; clustered at 99%)

Quick Action Cards (what to do in everyday situations)

A) Strong separation, markers present

Do: Call the strain.
Report line: “E. coli O157:H7 confirmed. Strong score separation (SF ≥ 100 and/or ΔS ≥ 20), coverage ≥ 85%. Markers rfbE, fliC-H7, stx2 detected (≥95% id; breadth ≥90%; depth ≥20×). Correlate clinically.”

B) Strong separation, markers absent

Do: Report species-level; advise targeted testing.
Report line: “E. coli identified; strain undetermined. Despite strong top-cluster statistics, O157/H7/toxin markers were not detected at the required thresholds. Recommend PCR/serotyping if the strain affects care.”

C) Weak separation (SF < 100 and ΔS < 20)

Do: LCA to species; get more data (extra loci/WGS).
Report line: “E. coli complex; current locus is non-discriminatory. Additional loci or WGS recommended.”

D) Two strong clusters (possible mixture)

Do: Suspect mixed sample. Re-isolate & re-sequence; or perform read binning.
Report line: “Mixed signal compatible with ≥2 Escherichia strains. Recommend subculture and repeat testing.”

Worked Mini-Examples

Example 1 — Likely true top cluster → strain call with markers

Top 1–10: E. coli O157:H7, E = 1e-95 to 1e-88, qcovs 88–92%
11–50: E. coli K-12, E = 1e-60 to 1e-50, qcovs 60–70%
SF = 1e-60 / 1e-88 = 1e28; ΔS = 30 bits
Markers: rfbE, fliC-H7, stx2 present (≥95% id; breadth ≥90%; depth ≥20×)Call: O157:H7 (marker-supported)

Example 2 — Split but not separated → species-level only

Top 1–8: O157:H7, E = 1e-70–1e-65, cov 72–76%
9–50: K-12, E = 1e-63–1e-60, cov 70–75%
SF ≈ 1e-60/1e-65 = 1e5 (formally ≥100) but ΔS < 10 and coverage modest; no markersCall: E. coli species, strain undetermined; recommend PCR/serotyping

Example 3 — Conserved region (generic)

Many Escherichia hits, E = 1e-40–1e-38, coverage <50%Call: Escherichia (or E. coli complex); get longer region/WGS

Troubleshooting Matrix

Symptom	Likely cause	Fix
Long tail of K-12 at low ranks	DB redundancy	Use RefSeq/non-redundant, cluster at 99%
Many similar E-values, low coverage	Conserved/short region	Sequence a longer locus; add a second gene
Great stats but no markers	Not that strain	Report species; run PCR/serotype
Two distinct high clusters	Mixed sample	Subculture; repeat; or bin reads
Lots of weak hits	Loose filters	Enforce E ≤ 1e-20 and coverage ≥ 70–80%

One-Page Checklist (print/pin)

Filter ☐ E ≤ 1e-20 (short fragments ≤ 1e-30) ☐ Query coverage ≥ 70–80% (organism calls) ☐ For markers: breadth ≥ 80%; depth ≥ 10× (≥20× ideal)

Top cluster ☐ Define boundary (rank 10 vs 11 or clear drop) ☐ SF ≥ 100? ☐ ΔS ≥ 20 bits? ☐ Coverage of top hits ≥ 80%?

Biology ☐ Marker panel selected for target strain ☐ Each marker: id ≥ 95–97%, breadth ≥ 80%, depth ≥ 10× ☐ No conflicting best-hit allele

Decision ☐ Strain call (stats + markers) ☐ Species only (stats OK, markers absent/weak) ☐ Mixed sample suspected (two strong clusters) ☐ More data needed (add loci/WGS)

Report (paste) ☐ Included stats (E-value range, SF, ΔS, coverage) ☐ Marker results with thresholds ☐ Recommendation (PCR/serotype/re-isolate/WGS)

Optional: Minimal BLAST settings (safe defaults)

BLASTn DNA: E ≤ 1e-20; require qcovs ≥ 75%; leave gaps/matrix default
BLASTx/tBLASTn: for proteins, E ≤ 1e-10 (discovery) then ≤ 1e-20 (confirmation)
Export outfmt 6 with evalue, bitscore, qcovs for quick SF/ΔS checks

Final take-home

Numbers first (SF, ΔS, coverage), biology next (markers).
Strain names require markers; otherwise report species-level.
When the list “splits,” trust the top cluster only if it’s separated and biologically supported.