top of page

SOP: Resolving Split Hit Patterns in Microbial Identification with Statistical and Biological Confirmation

  • Writer: Mayta
    Mayta
  • 9 hours ago
  • 4 min read

Goal: make a safe, defensible call (strain vs species) when top hits and lower hits disagree.


Inputs you need (from your search output)

  • For BLAST/aligners: E-value, bit score, query coverage (often qcovs), % identity, alignment length

  • For read mapping/WGS: breadth (% of gene covered), depth (× coverage) per marker gene

  • For MALDI-TOF: instrument “score category” (use vendor “high-confidence” tier as species-level; treat “low/borderline” as screening only)


Step 0 — Filter out noise (mandatory)

Keep a hit only if both are true:

  1. E-value ≤ 1e-20

    • If your query is short (<400 bp DNA or short peptide), make it stricter (≤ 1e-30).

  2. Query coverage ≥ 70–80% (for organism-level interpretation)

    • For marker genes, require breadth ≥ 80% and depth ≥ 10×.

Quick BLAST tip: export outfmt 6 with qcovs and bitscore so you can filter quickly: -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"

Step 1 — Identify the “top cluster”

  • Look at ranks 1–5 (up to 1–10) after filtering.

  • A top cluster = several consecutive hits with similar, high bit scores/E-values and similar coverage.

  • Find the boundary where scores clearly drop (commonly between rank 10 and 11).

Compute two numbers at the boundary:

  1. Separation Factor (SF)

  • Example: if rank 10 E = 1e-88 and rank 11 E = 1e-60 → SF = 1e-60 / 1e-88 = 1e28.

  • Bit-score gap (ΔS)[\Delta S = \text{bit score at last hit of top tier} - \text{bit score at first hit of lower tier}]

    • Example: if bit scores are 210 (rank 10) and 188 (rank 11) → ΔS = 22.

Heuristics: SF ≥ 100 and/or ΔS ≥ 20 bits ⇒ top cluster is meaningfully stronger.

Excel tip (E-values in cells E10 and E11, bit scores in F10, F11):

  • SF: =E11/E10

  • ΔS: =F10 - F11

Step 2 — If the top cluster is consistent & separated

  • Criteria: SF ≥ 100 and/or ΔS ≥ 20 bits, coverage ≥ 80%, and %ID reasonable for the locus (DNA ≥95–97% for species-level; strain claims demand more).

  • Tentative statistical trust in the top cluster.

  • Do not name the strain yet—go to Step 3 (biology).

Step 3 — Demand biological confirmation (markers)

Look for strain-specific markers for the candidate strain. Example for E. coli O157:H7:

  • rfbE (O157), fliC-H7 (H7), stx1 / stx2, eae

Call a marker “present” only if all are true:

  • Identity ≥ 95–97%

  • Breadth ≥ 80% of the gene

  • Depth ≥ 10× (≥20× preferred for clinical decisions)

  • No conflicting best hit to a non-target allele

Interpretation:

  • Markers present (>=2 key markers) → you can name the strain (with a note of marker support).

  • Markers absent/weak → do not call strain. Report species-level and recommend confirmatory tests (PCR/serotype/phenotype).

Step 4 — If the top cluster is not clearly separated

  • Use Lowest Common Ancestor (LCA): report at species (e.g., Escherichia coli).

  • Avoid strain claims.

  • Consider more data:

    • Longer reads, second locus, or WGS

    • Re-isolation to rule out mixed sample

    • Curated/non-redundant reference set (RefSeq; clustered at 99%)

Quick Action Cards (what to do in everyday situations)

A) Strong separation, markers present

  • Do: Call the strain.

  • Report line: “E. coli O157:H7 confirmed. Strong score separation (SF ≥ 100 and/or ΔS ≥ 20), coverage ≥ 85%. Markers rfbE, fliC-H7, stx2 detected (≥95% id; breadth ≥90%; depth ≥20×). Correlate clinically.”

B) Strong separation, markers absent

  • Do: Report species-level; advise targeted testing.

  • Report line: “E. coli identified; strain undetermined. Despite strong top-cluster statistics, O157/H7/toxin markers were not detected at the required thresholds. Recommend PCR/serotyping if the strain affects care.”

C) Weak separation (SF < 100 and ΔS < 20)

  • Do: LCA to species; get more data (extra loci/WGS).

  • Report line: “E. coli complex; current locus is non-discriminatory. Additional loci or WGS recommended.”

D) Two strong clusters (possible mixture)

  • Do: Suspect mixed sample. Re-isolate & re-sequence; or perform read binning.

  • Report line: “Mixed signal compatible with ≥2 Escherichia strains. Recommend subculture and repeat testing.”

Worked Mini-Examples

Example 1 — Likely true top cluster → strain call with markers

  • Top 1–10: E. coli O157:H7, E = 1e-95 to 1e-88, qcovs 88–92%

  • 11–50: E. coli K-12, E = 1e-60 to 1e-50, qcovs 60–70%

  • SF = 1e-60 / 1e-88 = 1e28; ΔS = 30 bits

  • Markers: rfbE, fliC-H7, stx2 present (≥95% id; breadth ≥90%; depth ≥20×)Call: O157:H7 (marker-supported)

Example 2 — Split but not separated → species-level only

  • Top 1–8: O157:H7, E = 1e-70–1e-65, cov 72–76%

  • 9–50: K-12, E = 1e-63–1e-60, cov 70–75%

  • SF ≈ 1e-60/1e-65 = 1e5 (formally ≥100) but ΔS < 10 and coverage modest; no markersCall: E. coli species, strain undetermined; recommend PCR/serotyping

Example 3 — Conserved region (generic)

  • Many Escherichia hits, E = 1e-40–1e-38, coverage <50%Call: Escherichia (or E. coli complex); get longer region/WGS


Troubleshooting Matrix

Symptom

Likely cause

Fix

Long tail of K-12 at low ranks

DB redundancy

Use RefSeq/non-redundant, cluster at 99%

Many similar E-values, low coverage

Conserved/short region

Sequence a longer locus; add a second gene

Great stats but no markers

Not that strain

Report species; run PCR/serotype

Two distinct high clusters

Mixed sample

Subculture; repeat; or bin reads

Lots of weak hits

Loose filters

Enforce E ≤ 1e-20 and coverage ≥ 70–80%

One-Page Checklist (print/pin)

Filter ☐ E ≤ 1e-20 (short fragments ≤ 1e-30) ☐ Query coverage ≥ 70–80% (organism calls) ☐ For markers: breadth ≥ 80%; depth ≥ 10× (≥20× ideal)

Top cluster ☐ Define boundary (rank 10 vs 11 or clear drop) ☐ SF ≥ 100? ☐ ΔS ≥ 20 bits? ☐ Coverage of top hits ≥ 80%?

Biology ☐ Marker panel selected for target strain ☐ Each marker: id ≥ 95–97%, breadth ≥ 80%, depth ≥ 10× ☐ No conflicting best-hit allele

Decision ☐ Strain call (stats + markers) ☐ Species only (stats OK, markers absent/weak) ☐ Mixed sample suspected (two strong clusters) ☐ More data needed (add loci/WGS)

Report (paste) ☐ Included stats (E-value range, SF, ΔS, coverage) ☐ Marker results with thresholds ☐ Recommendation (PCR/serotype/re-isolate/WGS)

Optional: Minimal BLAST settings (safe defaults)

  • BLASTn DNA: E ≤ 1e-20; require qcovs ≥ 75%; leave gaps/matrix default

  • BLASTx/tBLASTn: for proteins, E ≤ 1e-10 (discovery) then ≤ 1e-20 (confirmation)

  • Export outfmt 6 with evalue, bitscore, qcovs for quick SF/ΔS checks


Final take-home

  • Numbers first (SF, ΔS, coverage), biology next (markers).

  • Strain names require markers; otherwise report species-level.

  • When the list “splits,” trust the top cluster only if it’s separated and biologically supported.


Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page