A Compass For Biosynthetic Gene Cluster Prioritization
High‑throughput genome mining produces vast catalogs of predicted BGCs. Only a fraction can be explored experimentally. novelBGC provides a quantitative, transparent system to prioritize clusters using integrated similarity metrics and novelty cues drawn from multiple tools, guiding effort toward clusters most likely to yield new chemistry.
Reference Similarity (RS)
0–1 (higher = closer to characterized families)Novelty (N)
0–1 (higher = more distinct)Introduction
novelBGC is a computational framework designed to help researchers navigate the ever-growing landscape of microbial biosynthetic gene clusters (BGCs). With genome mining projects producing hundreds or even thousands of predicted clusters per genome, identifying which ones deserve experimental attention is a growing challenge. novelBGC addresses this by providing a robust, quantitative system to prioritize BGCs based on their similarity to known references and their potential for novelty.
Core Concept
Instead of rigid labels ("known" vs "orphan"), each cluster is positioned on two continuous axes:
- Reference Similarity (RS) :— proximity to well‑characterized biosynthetic families and known chemistry.
- Novelty (N) :— distinctiveness suggesting potential for new scaffold or metabolite classes.
Workflow Pipeline
Genome QC
QUASTAssembly stats (N50, GC, contig lengths) captured for quality flags & completeness cues.
BGC Identification
antiSMASHPredicts cluster boundaries, core biosynthetic domains, tailoring enzymes & product hints.
Ref Hits
MIBiGPer‑cluster similarity alignments condensed into coverage / density similarity features.
AA Export
CustomAll CDS translations gathered from BGC proteins in a FASTA bundle for downstream scans.
AMR Scan
CARD RGIResistance determinants flagged; optional influence on confidence / prioritization context.
Closest GCF
BiG-SLiCEDistances to nearest GCF in BigFam anchor clusters within global chemical space.
GCF Confidence
GCF↔MIBiGNearest GCF inspected for curated MIBiG members to gauge exploration maturity.
Feature Fuse
ScoringSimilarity, novelty, quality & context normalized → RS & N + interpretive flags.
- Genome Intake & QC — QUAST profiles assembly size, N50, GC, and completeness proxies; summary metrics surface directly in the UI.
- BGC Prediction (antiSMASH) — Core/secondary metabolite clusters delineated; domain calls, tailoring enzymes, and product class hints archived.
- Reference Anchoring — antiSMASH-internal MIBiG similarity hits recorded; per‑cluster alignment density captured as similarity features.
- Protein Extraction — Custom parser pulls all predicted BGC CDS translations into a unified FASTA bundle (per cluster provenance retained).
- AMR Context (CARD RGI) — Extracted proteins scanned for resistance determinants; presence/strength becomes optional caution / interest flags.
- Global Relational Placement (BiG-SLiCE) — All predicted clusters queried against the BigFam database; nearest Gene Cluster Families (GCFs) and distance metrics retrieved.
- GCF Lineage Check — For each nearest GCF, detect whether at least one member is a curated MIBiG BGC (anchors reliability vs novelty tension).
- Feature Synthesis — Normalized similarity, novelty, quality, and context features assembled; penalties (edge / fragmentation) applied; RS & N scores emitted with interpretive flags.
Score Interpretation
- Novel: High N, low RS; little resemblance to known GCFs.
- Likely Novel: Elevated N with moderate RS; warrants manual review.
- Uncertain: Conflicting signals or low confidence (fragmented / edge).
- Known / Similar: High RS; close to characterized families.
Why Prioritization Matters
Experimental validation is slow and resource intensive. Redundancy in predicted clusters obscures the handful with genuine novelty potential.
novelBGC surfaces clusters that are both sufficiently dissimilar to reference space and technically reliable enough to warrant follow-up.
User Customization
- Adjust weight matrices (similarity vs novelty emphasis)
- Tune GCF distance cutoffs
- Apply contig edge / fragmentation penalties
- Skip antismash process for faster analysis
Use Cases
- Novel RiPPsFind distinctive lanthipeptide / RiPP clusters with weak reference similarity.
- Genome Re‑analysisUncover overlooked partial / divergent clusters in familiar organisms.
- Screening PipelinesIntegrate RS–N scores alongside metabolomics or expression layers.
- Comparative PanelsRank clusters across related strains for innovation potential.
How Scoring Works
Features are normalized (0–1) and combined via adjustable weights into RS and N score components. Penalizations apply for low completeness or conflicting similarity signals.
- MIBiG hit density
- Distance to GCF
- GCF confidence
- Resistance gene presence
- Edge proximity
# features f1..fn RS = Σ w_i * similarity_feature_i N = Σ v_j * novelty_feature_j - penalties # both clamped to [0,1]
Ready to Discover Novel BGCs?
Start prioritizing your biosynthetic gene clusters with novelBGC today!