Reproducibility
Run the same scan twice on the same commit — you get the same findings, same severities, same gate verdict. Bit-identical output. This page explains how.
Most scanning platforms don’t do this. LLMs are stochastic. Scanner rules update silently. Version drift between CI runs produces different results. We’ve engineered against all three.
The Three Sources of Non-Determinism (and how we kill each)
1. Stochastic LLM output → temperature=0 + cache
Every AI triage call to GPT-4o, Gemini, and Claude uses temperature=0 (greedy decoding — no sampling). The output is deterministic per LLM weights.
But model providers silently roll weights. To pin output against provider drift, every LLM response is cached in Postgres:
CREATE TABLE ai_triage_cache (
cache_key VARCHAR(128) PRIMARY KEY,
rule_id VARCHAR(128),
code_hash VARCHAR(64),
model_version VARCHAR(64),
prompt_version VARCHAR(32),
result_json TEXT,
hit_count INT DEFAULT 0
);cache_key = sha256(rule_id || code_hash || model || prompt_version) — identical input → identical cache hit → identical output forever. Until you explicitly invalidate (by bumping prompt_version).
2. Drifting scanner rules → pinned tool versions
Scanner tools are pinned in the Dockerfile via ARG:
ARG SEMGREP_VERSION=1.95.0
ARG CHECKOV_VERSION=3.2.334
ARG TRIVY_VERSION=0.70.0
ARG GITLEAKS_VERSION=8.21.2
ARG HADOLINT_VERSION=2.12.0
ARG JSCPD_VERSION=4.0.4Every scan records the tool versions that ran in ScanResult.tool_versions_json. You can prove which rules were active for any historical scan.
3. Drifting rule catalogs → rule version tagging
Every finding stores rule_version — the version of the rule catalog active when the scan ran. If we ever update Semgrep rules, old scans keep their rule_version stamp. Re-running with the new catalog produces a different rule_version + different cache key — fresh results, audit trail intact.
Verifying Reproducibility
Run two scans on the same commit:
SCAN_A=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \
"https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id)
# wait for completion...
SCAN_B=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \
"https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id)
# wait for completion...
# Compare
diff <(curl -sf -H "Authorization: Bearer $JWT" \
"https://codestax.co/api/scans/$SCAN_A/issues" | jq -S '.[] | {fingerprint, severity, rule_id}') \
<(curl -sf -H "Authorization: Bearer $JWT" \
"https://codestax.co/api/scans/$SCAN_B/issues" | jq -S '.[] | {fingerprint, severity, rule_id}')
# → empty diffShould produce zero differences.
When Reproducibility Is Expected to Break
We’re honest about when results should change:
| Input change | Output change | Intentional? |
|---|---|---|
| Code change | ✓ new fingerprints, new findings | ✓ |
| Tool version bump | ✓ new rule IDs, possibly different severity mapping | ✓ |
| Prompt-template change | ✓ cache invalidates on prompt_version bump | ✓ |
| LLM provider changes model weights | ✗ cache absorbs — stable | ✓ |
| Scanner container restart | ✗ cache persists in Postgres | ✓ |
| Cache DB corruption or flush | ✗ next scan re-calls LLM; results identical if model weights unchanged | ✓ |
Transparency Commitments
- Every scan emits a manifest —
ScanResult.tool_versions_jsonlists every tool + version used. - Rule catalog is versioned — rule changes produce a new
rule_version, not a silent update. - LLM responses are cached with (model, prompt) in the key — we can reconstruct which model said what, even months later.
- Fingerprints are whitespace-invariant — reformatting churn doesn’t break baselines.
Roadmap
- Public FP-rate dashboard — per-rule FP rate from user Mark-FP actions, tracked over time. Surfaces which rules are too noisy. Expected Q2.
- Manifest download — single-click export of scan manifest + every input hash for audit. Expected Q2.
- Scan reproducibility badge on every scan detail page: ✓ deterministic vs ⚠ drift.
Related
- Quality Ratings — what the ratings are + how they’re computed
- Scanner Details — tool list + pins
- Quality Gate API —
tool_versions_jsonfield in ratings response