SSI Architecture
The Scam Site Investigator (SSI) is an always-on Cloud Run service that automates browser-based reconnaissance, OSINT collection, and cryptocurrency wallet extraction for suspected scam websites.
Investigation Flow
Key Components
Orchestrator
src/ssi/investigator/orchestrator.py
Entry point; coordinates browser, OSINT, and wallet extraction steps
Browser Module
src/ssi/browser/
Playwright-based page navigation, screenshot capture, DOM extraction
OSINT Modules
src/ssi/osint/
Passive DNS, WHOIS, TLS certificate, and hosting analysis
Wallet Extraction
Embedded in orchestrator
Regex + heuristic scanning for cryptocurrency addresses (BTC, ETH, USDT)
Identity Vault
src/ssi/identity/vault.py
Generates synthetic PII for safe interaction with suspicious sites
ScanStore
src/ssi/store/scan_store.py
Persists scan results and creates cases via direct DB writes
Wallet Allowlist
config/wallet_allowlist.json
Filters known-good exchange/service wallets from indicator submissions
Integration with Core
SSI writes directly to the shared database — it does not go through the Core API for data persistence. This design keeps investigation latency low and avoids circular API dependencies.
Case creation
SSI → DB
ScanStore.create_case_record() writes cases with wallet indicators and OSINT entities
Evidence storage
SSI → GCS
Screenshots, DOM snapshots, and session logs stored in the shared evidence bucket
Investigation trigger
Core API → SSI
Analysts call core-svc at POST /investigations/ssi; core dispatches to ssi-svc at POST /trigger/investigate
Case ↔ investigation link
Core DB
case_investigations join table — one case can have many investigations and vice versa
Auto-investigation
Core → SSI
auto_investigate job finds case URLs, deduplicates, and triggers SSI via HTTP
URL deduplication
Core DB
site_scans.normalized_url with staleness window prevents redundant scans
eCrimeX sharing
DB → ECX
Extracted indicators are shared via the eCrimeX integration pipeline
Analytics refresh
DB → Analytics
Entity stats from SSI-created cases feed into the aggregation pipeline
Case ↔ Investigation Linking
The case_investigations join table provides a many-to-many relationship between cases and SSI investigations. A single investigation (identified by scan_id) can be linked to multiple cases that share the same URL, and a single case can have multiple investigations for different URLs:
Linking happens at three points:
Case-created scans —
ScanStore.create_case_record()writes to bothsite_scans.case_idandcase_investigationswithtrigger_type='case_created'.Manual triggers —
POST /cases/{id}/investigatecreates acase_investigationsrow withtrigger_type='manual'.Auto-investigation — The
auto_investigateworker job queries URL indicators, deduplicates bynormalized_url, and links results withtrigger_type='auto'.
Auto-Investigation
The auto_investigate job (src/i4g/worker/jobs/auto_investigate.py) runs as a Cloud Run job or via CLI:
Query
indicatorsfor URLs not yet linked to any investigation.Normalize and group URLs to avoid duplicate triggers.
Filter through the domain blocklist (configurable via
auto_investigate.domain_blocklist).Check each URL against
site_scans.normalized_urlwith a staleness window.Trigger SSI investigations for qualifying URLs.
Link results back to all originating cases via
case_investigations.
Configuration lives under the auto_investigate settings section — see Configuration.
Evidence Storage
Evidence artifacts (screenshots, DOM snapshots, reports) use UUID-prefix sharding for even distribution across storage backends:
See Evidence Storage for the full design.
Deployment
Runtime: Always-on Cloud Run service (min instances = 1 for warm start)
Docker image:
ssi-svc(includes Playwright + Chromium)Conda environment:
i4g-ssi(separate from corei4g)Config:
SSI_*environment variables (seessi/config/settings.default.toml)
For the full SSI user guide, see the Scam Site Investigator section.
Last updated