Reproducibility

CodeStax pins deterministic scanner versions and records execution evidence so rescan differences can be explained. Deterministic analyzer and scoring inputs are designed to reproduce; AI-enriched text can still change when a scoped cache is missed or intentionally bypassed and provider weights have changed.

Most scanning platforms don’t do this. LLMs are stochastic. Scanner rules update silently. Version drift between CI runs produces different results. We’ve engineered against all three.

The Three Sources of Non-Determinism (and how we kill each)

1. Stochastic LLM output → `temperature=0` plus a scoped cache

Eligible generic full-scan AI triage uses low-temperature decoding and may use a cache keyed by code, rule, model, and prompt versions. That improves repeatability while the same cache entry remains valid; it is not a guarantee that every AI call is cached or that provider output can never drift.

The generic triage cache uses a structure like:


CREATE TABLE ai_triage_cache (
    cache_key VARCHAR(128) PRIMARY KEY,
    rule_id VARCHAR(128),
    code_hash VARCHAR(64),
    model_version VARCHAR(64),
    prompt_version VARCHAR(32),
    result_json TEXT,
    hit_count INT DEFAULT 0
);

cache_key = sha256(rule_id || code_hash || model || prompt_version) — an identical eligible request can reuse the stored result until expiry or invalidation. Governed PR review analysis is a separate path: when source retention is disabled, it bypasses reads and writes to the local raw AI-response cache. Derived PR findings and immutable execution evidence can still be stored under their own retention controls.

2. Drifting scanner rules → pinned tool versions

Scanner tools are pinned in the Dockerfile via ARG:


ARG SEMGREP_VERSION=1.95.0
ARG CHECKOV_VERSION=3.2.334
ARG TRIVY_VERSION=0.70.0
ARG GITLEAKS_VERSION=8.21.2
ARG HADOLINT_VERSION=2.12.0
ARG JSCPD_VERSION=4.0.4

Every scan records the tool versions that ran in ScanResult.tool_versions_json. You can prove which rules were active for any historical scan.

3. Drifting rule catalogs → rule version tagging

Every finding stores rule_version — the version of the rule catalog active when the scan ran. If we ever update Semgrep rules, old scans keep their rule_version stamp. Re-running with the new catalog produces a different rule_version + different cache key — fresh results, audit trail intact.

Verifying Reproducibility

Run two scans on the same commit:


SCAN_A=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \
  "https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id)
# wait for completion...
SCAN_B=$(curl -sf -X POST -H "Authorization: Bearer $JWT" \
  "https://codestax.co/api/scans/trigger/$REPO_ID" | jq -r .id)
# wait for completion...
 
# Compare
diff <(curl -sf -H "Authorization: Bearer $JWT" \
  "https://codestax.co/api/scans/$SCAN_A/issues" | jq -S '.[] | {fingerprint, severity, rule_id}') \
     <(curl -sf -H "Authorization: Bearer $JWT" \
  "https://codestax.co/api/scans/$SCAN_B/issues" | jq -S '.[] | {fingerprint, severity, rule_id}')
# → empty diff

Should produce zero differences.

When Reproducibility Is Expected to Break

We’re honest about when results should change:

Input change	Output change	Intentional?
Code change	✓ new fingerprints, new findings	✓
Tool version bump	✓ new rule IDs, possibly different severity mapping	✓
Prompt-template change	✓ cache invalidates on `prompt_version` bump	✓
LLM provider changes model weights	Cache hits remain stable; misses or cache-bypassed governed PR AI may change	Expected and disclosed
Scanner container restart	✗ cache persists in Postgres	✓
Cache expiry, invalidation, or flush	The next eligible call reaches the provider and may change if provider weights changed	Expected and disclosed

Transparency Commitments

Every scan emits a manifest — ScanResult.tool_versions_json lists every tool + version used.
Rule catalog is versioned — rule changes produce a new rule_version, not a silent update.
Eligible generic full-scan AI responses use a versioned cache key — governed PR AI with source retention disabled is deliberately outside that raw-cache promise, and its immutable execution evidence records the route that actually ran.
Fingerprints are whitespace-invariant — reformatting churn doesn’t break baselines.

Roadmap

Public FP-rate dashboard — per-rule FP rate from user Mark-FP actions, tracked over time. Surfaces which rules are too noisy. Expected Q2.
Manifest download — single-click export of scan manifest + every input hash for audit. Expected Q2.
Scan reproducibility badge on every scan detail page: ✓ deterministic vs ⚠ drift.

Quality Ratings — what the ratings are + how they’re computed
Scanner Details — tool list + pins
Quality Gate API — tool_versions_json field in ratings response