SOP: Resolving Split Hit Patterns in Microbial Identification with Statistical and Biological Confirmation
- Mayta
- 9 hours ago
- 4 min read
Goal: make a safe, defensible call (strain vs species) when top hits and lower hits disagree.
Inputs you need (from your search output)
For BLAST/aligners: E-value, bit score, query coverage (often qcovs), % identity, alignment length
For read mapping/WGS: breadth (% of gene covered), depth (× coverage) per marker gene
For MALDI-TOF: instrument “score category” (use vendor “high-confidence” tier as species-level; treat “low/borderline” as screening only)
Step 0 — Filter out noise (mandatory)
Keep a hit only if both are true:
E-value ≤ 1e-20
If your query is short (<400 bp DNA or short peptide), make it stricter (≤ 1e-30).
Query coverage ≥ 70–80% (for organism-level interpretation)
For marker genes, require breadth ≥ 80% and depth ≥ 10×.
Quick BLAST tip: export outfmt 6 with qcovs and bitscore so you can filter quickly: -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"
Step 1 — Identify the “top cluster”
Look at ranks 1–5 (up to 1–10) after filtering.
A top cluster = several consecutive hits with similar, high bit scores/E-values and similar coverage.
Find the boundary where scores clearly drop (commonly between rank 10 and 11).
Compute two numbers at the boundary:
Separation Factor (SF)
Example: if rank 10 E = 1e-88 and rank 11 E = 1e-60 → SF = 1e-60 / 1e-88 = 1e28.
Bit-score gap (ΔS)[\Delta S = \text{bit score at last hit of top tier} - \text{bit score at first hit of lower tier}]
Example: if bit scores are 210 (rank 10) and 188 (rank 11) → ΔS = 22.
Heuristics: SF ≥ 100 and/or ΔS ≥ 20 bits ⇒ top cluster is meaningfully stronger.
Excel tip (E-values in cells E10 and E11, bit scores in F10, F11):
SF: =E11/E10
ΔS: =F10 - F11
Step 2 — If the top cluster is consistent & separated
Criteria: SF ≥ 100 and/or ΔS ≥ 20 bits, coverage ≥ 80%, and %ID reasonable for the locus (DNA ≥95–97% for species-level; strain claims demand more).
Tentative statistical trust in the top cluster.
Do not name the strain yet—go to Step 3 (biology).
Step 3 — Demand biological confirmation (markers)
Look for strain-specific markers for the candidate strain. Example for E. coli O157:H7:
rfbE (O157), fliC-H7 (H7), stx1 / stx2, eae
Call a marker “present” only if all are true:
Identity ≥ 95–97%
Breadth ≥ 80% of the gene
Depth ≥ 10× (≥20× preferred for clinical decisions)
No conflicting best hit to a non-target allele
Interpretation:
Markers present (>=2 key markers) → you can name the strain (with a note of marker support).
Markers absent/weak → do not call strain. Report species-level and recommend confirmatory tests (PCR/serotype/phenotype).
Step 4 — If the top cluster is not clearly separated
Use Lowest Common Ancestor (LCA): report at species (e.g., Escherichia coli).
Avoid strain claims.
Consider more data:
Longer reads, second locus, or WGS
Re-isolation to rule out mixed sample
Curated/non-redundant reference set (RefSeq; clustered at 99%)
Quick Action Cards
(what to do in everyday situations)
A) Strong separation, markers present
Do: Call the strain.
Report line: “E. coli O157:H7 confirmed. Strong score separation (SF ≥ 100 and/or ΔS ≥ 20), coverage ≥ 85%. Markers rfbE, fliC-H7, stx2 detected (≥95% id; breadth ≥90%; depth ≥20×). Correlate clinically.”
B) Strong separation, markers absent
Do: Report species-level; advise targeted testing.
Report line: “E. coli identified; strain undetermined. Despite strong top-cluster statistics, O157/H7/toxin markers were not detected at the required thresholds. Recommend PCR/serotyping if the strain affects care.”
C) Weak separation (SF < 100 and ΔS < 20)
Do: LCA to species; get more data (extra loci/WGS).
Report line: “E. coli complex; current locus is non-discriminatory. Additional loci or WGS recommended.”
D) Two strong clusters (possible mixture)
Do: Suspect mixed sample. Re-isolate & re-sequence; or perform read binning.
Report line: “Mixed signal compatible with ≥2 Escherichia strains. Recommend subculture and repeat testing.”
Worked Mini-Examples
Example 1 — Likely true top cluster → strain call with markers
Top 1–10: E. coli O157:H7, E = 1e-95 to 1e-88, qcovs 88–92%
11–50: E. coli K-12, E = 1e-60 to 1e-50, qcovs 60–70%
SF = 1e-60 / 1e-88 = 1e28; ΔS = 30 bits
Markers: rfbE, fliC-H7, stx2 present (≥95% id; breadth ≥90%; depth ≥20×)Call: O157:H7 (marker-supported)
Example 2 — Split but not separated → species-level only
Top 1–8: O157:H7, E = 1e-70–1e-65, cov 72–76%
9–50: K-12, E = 1e-63–1e-60, cov 70–75%
SF ≈ 1e-60/1e-65 = 1e5 (formally ≥100) but ΔS < 10 and coverage modest; no markersCall: E. coli species, strain undetermined; recommend PCR/serotyping
Example 3 — Conserved region (generic)
Many Escherichia hits, E = 1e-40–1e-38, coverage <50%Call: Escherichia (or E. coli complex); get longer region/WGS
Troubleshooting Matrix
Symptom | Likely cause | Fix |
Long tail of K-12 at low ranks | DB redundancy | Use RefSeq/non-redundant, cluster at 99% |
Many similar E-values, low coverage | Conserved/short region | Sequence a longer locus; add a second gene |
Great stats but no markers | Not that strain | Report species; run PCR/serotype |
Two distinct high clusters | Mixed sample | Subculture; repeat; or bin reads |
Lots of weak hits | Loose filters | Enforce E ≤ 1e-20 and coverage ≥ 70–80% |
One-Page Checklist (print/pin)
Filter ☐ E ≤ 1e-20 (short fragments ≤ 1e-30) ☐ Query coverage ≥ 70–80% (organism calls) ☐ For markers: breadth ≥ 80%; depth ≥ 10× (≥20× ideal)
Top cluster ☐ Define boundary (rank 10 vs 11 or clear drop) ☐ SF ≥ 100? ☐ ΔS ≥ 20 bits? ☐ Coverage of top hits ≥ 80%?
Biology ☐ Marker panel selected for target strain ☐ Each marker: id ≥ 95–97%, breadth ≥ 80%, depth ≥ 10× ☐ No conflicting best-hit allele
Decision ☐ Strain call (stats + markers) ☐ Species only (stats OK, markers absent/weak) ☐ Mixed sample suspected (two strong clusters) ☐ More data needed (add loci/WGS)
Report (paste)
☐ Included stats (E-value range, SF, ΔS, coverage)
☐ Marker results with thresholds
☐ Recommendation (PCR/serotype/re-isolate/WGS)
Optional: Minimal BLAST settings (safe defaults)
BLASTn DNA: E ≤ 1e-20; require qcovs ≥ 75%; leave gaps/matrix default
BLASTx/tBLASTn: for proteins, E ≤ 1e-10 (discovery) then ≤ 1e-20 (confirmation)
Export outfmt 6 with evalue, bitscore, qcovs for quick SF/ΔS checks
Final take-home
Numbers first (SF, ΔS, coverage), biology next (markers).
Strain names require markers; otherwise report species-level.
When the list “splits,” trust the top cluster only if it’s separated and biologically supported.
Comments