← All posts

SOP: Resolving Split Hit Patterns in Microbial Identification with Statistical and Biological Confirmation

Clinical Epidemiology ResearchUniqcret doctor knowledgesBioinformaticsDiagnosis [Methodology]

Goal: make a safe, defensible call (strain vs species) when top hits and lower hits disagree.


Inputs you need (from your search output)


Step 0 — Filter out noise (mandatory)

Keep a hit only if both are true:

  1. E-value ≤ 1e-20
    • If your query is short (<400 bp DNA or short peptide), make it stricter (≤ 1e-30).
  2. Query coverage ≥ 70–80% (for organism-level interpretation)
    • For marker genes, require breadth ≥ 80% and depth ≥ 10×.

Quick BLAST tip: export outfmt 6 with qcovs and bitscore so you can filter quickly: -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"


Step 1 — Identify the “top cluster”

Compute two numbers at the boundary:

  1. Separation Factor (SF)
SF = E-value at start of lower tier E-value at end of top tier
ΔS = bit score at last hit of top tier bit score at first hit of lower tier

Heuristics: SF ≥ 100 and/or ΔS ≥ 20 bits ⇒ top cluster is meaningfully stronger.

Excel tip (E-values in cells E10 and E11, bit scores in F10, F11):


Step 2 — If the top cluster is consistent & separated


Step 3 — Demand biological confirmation (markers)

Look for strain-specific markers for the candidate strain. Example for E. coli O157:H7:

Call a marker “present” only if all are true:

Interpretation:


Step 4 — If the top cluster is not clearly separated


Quick Action Cards (what to do in everyday situations)

A) Strong separation, markers present

B) Strong separation, markers absent

C) Weak separation (SF < 100 and ΔS < 20)

D) Two strong clusters (possible mixture)


Worked Mini-Examples

Example 1 — Likely true top cluster → strain call with markers

Example 2 — Split but not separated → species-level only

Example 3 — Conserved region (generic)


Troubleshooting Matrix

SymptomLikely causeFix
Long tail of K-12 at low ranksDB redundancyUse RefSeq/non-redundant, cluster at 99%
Many similar E-values, low coverageConserved/short regionSequence a longer locus; add a second gene
Great stats but no markersNot that strainReport species; run PCR/serotype
Two distinct high clustersMixed sampleSubculture; repeat; or bin reads
Lots of weak hitsLoose filtersEnforce E ≤ 1e-20 and coverage ≥ 70–80%

One-Page Checklist (print/pin)

Filter ☐ E ≤ 1e-20 (short fragments ≤ 1e-30) ☐ Query coverage ≥ 70–80% (organism calls) ☐ For markers: breadth ≥ 80%; depth ≥ 10× (≥20× ideal)

Top cluster ☐ Define boundary (rank 10 vs 11 or clear drop) ☐ SF ≥ 100? ☐ ΔS ≥ 20 bits? ☐ Coverage of top hits ≥ 80%?

Biology ☐ Marker panel selected for target strain ☐ Each marker: id ≥ 95–97%, breadth ≥ 80%, depth ≥ 10× ☐ No conflicting best-hit allele

Decision ☐ Strain call (stats + markers) ☐ Species only (stats OK, markers absent/weak) ☐ Mixed sample suspected (two strong clusters) ☐ More data needed (add loci/WGS)

Report (paste) ☐ Included stats (E-value range, SF, ΔS, coverage) ☐ Marker results with thresholds ☐ Recommendation (PCR/serotype/re-isolate/WGS)


Optional: Minimal BLAST settings (safe defaults)


Final take-home