Skip to Content
ScanningReproducibility

Reproducibility

Run the same scan twice on the same commit — you get the same findings, same severities, same gate verdict. Bit-identical output. This page explains how.

Most scanning platforms don’t do this. LLMs are stochastic. Scanner rules update silently. Version drift between CI runs produces different results. We’ve engineered against all three.

The Three Sources of Non-Determinism (and how we kill each)

1. Stochastic LLM output → temperature=0 + cache

Every AI triage call to GPT-4o, Gemini, and Claude uses temperature=0 (greedy decoding — no sampling). The output is deterministic per LLM weights.

But model providers silently roll weights. To pin output against provider drift, every LLM response is cached in Postgres:

CREATE TABLE ai_triage_cache ( cache_key VARCHAR(128) PRIMARY KEY, rule_id VARCHAR(128), code_hash VARCHAR(64), model_version VARCHAR(64), prompt_version VARCHAR(32), result_json TEXT, hit_count INT DEFAULT 0 );

cache_key = sha256(rule_id || code_hash || model || prompt_version) — identical input → identical cache hit → identical output forever. Until you explicitly invalidate (by bumping prompt_version).

2. Drifting scanner rules → pinned tool versions

Scanner tools are pinned in the Dockerfile via ARG:

ARG SEMGREP_VERSION=1.95.0 ARG CHECKOV_VERSION=3.2.334 ARG TRIVY_VERSION=0.70.0 ARG GITLEAKS_VERSION=8.21.2 ARG HADOLINT_VERSION=2.12.0 ARG JSCPD_VERSION=4.0.4

Every scan records the tool versions that ran in ScanResult.tool_versions_json. You can prove which rules were active for any historical scan.

3. Drifting rule catalogs → rule version tagging

Every finding stores rule_version — the version of the rule catalog active when the scan ran. If we ever update Semgrep rules, old scans keep their rule_version stamp. Re-running with the new catalog produces a different rule_version + different cache key — fresh results, audit trail intact.

Verifying Reproducibility

Run two scans on the same commit:

SCAN_A=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \ "https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id) # wait for completion... SCAN_B=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \ "https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id) # wait for completion... # Compare diff <(curl -sf -H "Authorization: Bearer $JWT" \ "https://codestax.co/api/scans/$SCAN_A/issues" | jq -S '.[] | {fingerprint, severity, rule_id}') \ <(curl -sf -H "Authorization: Bearer $JWT" \ "https://codestax.co/api/scans/$SCAN_B/issues" | jq -S '.[] | {fingerprint, severity, rule_id}') # → empty diff

Should produce zero differences.

When Reproducibility Is Expected to Break

We’re honest about when results should change:

Input changeOutput changeIntentional?
Code change✓ new fingerprints, new findings
Tool version bump✓ new rule IDs, possibly different severity mapping
Prompt-template change✓ cache invalidates on prompt_version bump
LLM provider changes model weights✗ cache absorbs — stable
Scanner container restart✗ cache persists in Postgres
Cache DB corruption or flush✗ next scan re-calls LLM; results identical if model weights unchanged

Transparency Commitments

  • Every scan emits a manifestScanResult.tool_versions_json lists every tool + version used.
  • Rule catalog is versioned — rule changes produce a new rule_version, not a silent update.
  • LLM responses are cached with (model, prompt) in the key — we can reconstruct which model said what, even months later.
  • Fingerprints are whitespace-invariant — reformatting churn doesn’t break baselines.

Roadmap

  • Public FP-rate dashboard — per-rule FP rate from user Mark-FP actions, tracked over time. Surfaces which rules are too noisy. Expected Q2.
  • Manifest download — single-click export of scan manifest + every input hash for audit. Expected Q2.
  • Scan reproducibility badge on every scan detail page: ✓ deterministic vs ⚠ drift.