SSI investigation evidence (screenshots, DOM snapshots, PDF reports, session logs) is stored using a UUID-prefix sharding scheme for even distribution across storage backends.
Sharding Scheme
Evidence artifacts are organized under a 2-level hex prefix derived from the scan UUID. This creates 65,536 possible shards ($16^4$) for balanced distribution across filesystem directories or GCS prefixes.
Evidence is stored under data/evidence/ relative to the project root:
The site_scans.evidence_path column stores the absolute local path.
Google Cloud Storage (Production)
Evidence is stored in the GCS bucket configured by storage.ssi_evidence_bucket with the prefix from storage.ssi_evidence_prefix:
The site_scans.evidence_path column stores the full gs:// URI.
Evidence Manifest
Each evidence directory contains a metadata.json manifest for integrity verification:
Manifests are generated automatically for new investigations and can be backfilled for existing scans using scripts/generate_evidence_manifests.py.
Evidence Resolution
The Core API resolves evidence locations using the following strategy (see core/src/i4g/api/ssi_evidence.py):
Explicit gs:// URI — If evidence_path starts with gs://, parse and use the bucket/key directly.
Sharded fallback — Construct the GCS location from settings (ssi_evidence_bucket / ssi_evidence_prefix) using the sharded path evidence_path(scan_id).
Flat fallback — Try the legacy flat path {prefix}/{scan_id}/ for pre-migration scans.
Local filesystem — For local development, serve files directly from the filesystem path.
Migration
Evidence created before the sharding scheme was introduced uses flat paths ({prefix}/{scan_id}/). The migration script moves artifacts to the sharded layout:
The migration is idempotent — scans already at sharded paths are skipped.
Lifecycle Management
Evidence retention follows these policies:
Active investigations: Evidence is retained indefinitely while the associated case is open.
Resolved cases: Evidence follows the same retention policy as case data (configurable per environment).
Purged cases: Evidence artifacts are deleted when the case is purged.
The sharded layout enables efficient lifecycle operations because purge scripts can target specific prefix shards without listing the entire bucket.