🧠

SPEC-STV-07-RAG-System

📜

SPEC-STV-07 · Spec header. Spec ID: SPEC-STV-07 · Title: RAG Knowledge Base · Version: 1.0.0 · Status: Planned · Authority: Specification · Priority: P0 · Owner role: RAG engineer · Reviewers: Backend architect, Security lead · Last reviewed: 2026-05-11 · Sync targets: app/Services/Rag/**, docs/RAG_SYSTEM.md · Depends on: SPEC-STV-HUB, SPEC-STV-02 · Consumed by: SPEC-STV-08, SPEC-STV-09 · Conflict rule: Hub wins. · Change policy: RAG engineer + Backend architect; Registry bump on backend/embed-model change.

1 · Goal

A workspace-scoped retrieval system that lets AI answer with citations to the workspace's own pages, blocks, comments, files, templates, and database rows. ON/OFF per workspace.

2 · Sources indexed

source_typeWhat is chunkedExcluded
pageTitle + concatenated visible text from all blocks (excluding archived).Private pages from other users (respect permissions at query time).
blockBlock-level chunks for long pages (chunked individually so citations point at a specific block).code blocks where metadata.secret = true.
commentComment body.Resolved comments older than 30 days (configurable).
fileExtracted text from PDFs / Markdown / plain text. Images get an OCR pass (P5+).Files > 50 MiB. Encrypted files.
templateTemplate name + description + flattened payload text.
db_rowConcatenation of denormalized value_text per row.Rows in archived databases.

3 · Chunking

  • Strategy: heading-aware splitter; max 1200 tokens / chunk, 150 token overlap.
  • Boundary preference: paragraph > sentence > token.
  • Each chunk stores { workspace_id, source_type, source_id, chunk_index, text, embedding JSON, content_hash sha256(text) }.

4 · Embedding model

  • Default: text-embedding-3-large (3072 dims). Configurable in rag_settings.embedding_model.
  • Provider: same OpenAI key surface as AI; a workspace may flip to a self-hosted provider (localai, ollama) via rag_settings.provider.
  • Reembedding on model change requires a full reindex job (admin-triggered).

5 · Retrieval

  • Top-K = 8 default, capped at 20.
  • Always filter by workspace_id AND by the caller's effective page permissions.
  • Re-rank: cosine similarity → optional cross-encoder (post-v1) for top-50 → final top-K.
  • Hits are returned with { page_uuid, block_id?, score, excerpt }.

6 · Citations format (AI side)

When the AI uses RAG hits, the response includes a citations array. The web client renders inline chips [1], [2], etc., resolving to the source page/block. No citation = no claim allowed (AiAnswerService enforces).

7 · Indexing queue

  • Trigger: every page/block/comment/file/template/db_row write fires an event; the listener queues IndexRagSourceJob keyed by (source_type, source_id).
  • Idempotency: the job computes content_hash; if unchanged, no-op.
  • Concurrency: per-workspace Cache::lock('workspace:{id}:rag-index') prevents thundering herds on bulk import. Per-source jobs do not block each other.
  • Backpressure: queue rag-index with low priority; Horizon supervisor limits 4 workers.
  • Stale detection: nightly job scans rag_chunks vs source updated_at; mismatches re-enqueue.

8 · Reindex endpoints

POST /rag/reindex (admin): { scope: "workspace|page|database", id }. GET /rag/status returns { chunks, stale, last_indexed_at, queue_depth }.

9 · Secret exclusion

  • Patterns scanned on chunk text before embedding: sk-[A-Za-z0-9]{20,}, ghp_[A-Za-z0-9]{20,}, AKIA[A-Z0-9]{16}, JWTs, xox[bp]-, lines containing BEGIN RSA. Matches → drop the chunk and emit an activity_logs warning.
  • Files marked metadata.is_secret = true are never embedded.

10 · Per-workspace ON/OFF

settings row rag.enabled = true|false. When disabled, RagService::query() returns 200 with { disabled: true, hits: [] } and the AI Q&A surface tells the user to enable it. Indexing also pauses.

11 · pgvector opt-in (advanced)

When rag_settings.vector_backend = "pgvector" is set on a workspace, a sidecar Postgres schema is provisioned with a table rag_vectors (rag_chunk_id bigint primary key, embedding vector(3072)). The MySQL embedding JSON column is kept for replay and migration; reads use the Postgres ANN index. The default backend stays MySQL JSON — pgvector is per-workspace opt-in, never the system default.

12 · Acceptance criteria

  1. Toggling rag.enabled ON for a workspace queues a full reindex; OFF stops indexing and short-circuits queries.
  1. A page edit updates rag_chunks within 60 s on default Horizon settings.
  1. RagService::query() never returns a hit the caller cannot read.
  1. Secret patterns in source text are dropped before embedding and logged.
  1. GET /rag/status reports accurate chunks, stale, last_indexed_at, queue_depth.