Luci · Ticket MC-4585. Vault embedding portable cache + chea
Luci
Ticket MC-4585. Vault embedding portable cache + cheaper API backend. Code work. Explore first, then dev-loop before edits.
Luci
Read vec_embed.py — embedding backend.
Luci
Now check scheduled vault rebuild task + index.py + gitignore.
Luci
vault-index task runs `index.py --rebuild` but NOT embed_memories.py after. Need append. Cache must be git-tracked (vault.db gitignored, per-machine). Plan clear. Tier 2 — invoke dev-loop.
Luci
Tier 2, code-only (non-UI). No Tessa. Plan:
1. New `scripts/embed_cache.py` — content-hash cache (chunk_hash+model+dim → blob), git-tracked portable.
2. `vec_embed.py` — backend abstraction: local (default) / opt-in OpenAI text-embedding-3-small @768, expose `model_id()`.
3. `embed_memories.py` — cache-aware reembed, `--sync-cache` backfill, cache stats in dry-run.
4. `tasks/vault-index.md` — run embed step after rebuild.
5. Smoke test + verification.
Check git state first.
Luci
On master. Dirty files are unrelated auto-sync churn — I'll stage only my files. Check baseline counts + confirm cache path won't be gitignored.
Luci
sqlite3 CLI lacks vec0. Count via python with extension loaded.
Luci
Baseline confirmed: files=4343, search_fts=4343, vec_meta=848, vec_memories=1120. Matches ticket. Now build.
Luci
Now refactor `vec_embed.py` for backend abstraction + `model_id()`:
Luci
Now wire the cache into `embed_memories.py`. Edit imports, reembed, add sync-cache + dry-run stats.
Luci
Now make `reembed` cache-aware and add a cache-embed helper:
Luci
Now update `main()` — add `--cache` / `--sync-cache` args, open cache, thread it through, report cache stats. Re-read main region:
Luci
Update the completion prints to report cache hits/misses:
Luci
Now wire the embed step into the nightly vault-index task:
Luci
vault-index now runs embed after rebuild (with `--max-rss-mb 2000` Luci safety). Now write the smoke test.
Luci
Existing tests use `reembed` without cache — my optional `cache=None` + tuple return keeps them working. Append cache tests.
Luci
Run the test suite:
Luci
Import path issue — pre-existing tests need `scripts` importable. Check setup.
Luci
All 9 pass. Now verify against live vault.db: backfill cache, then dry-run should show 0 pending.
Luci
Backfilled 1120 rows → 1108 unique cache entries (12 duplicate chunks deduped by content hash — a feature). Dry-run: **0 pending**. Now live proof of cross-machine reuse: drop one file's vec_meta so it's "pending", re-run bounded embed — expect 100% cache hits, 0 misses (no fastembed call).
Luci
**Decisive proof**: 102 cache hits, 0 misses, RSS 49MB — fastembed never loaded (would be ~950MB). A machine with the cache rebuilds vectors at zero compute cost. Final counts:
Luci
Counts healthy & restored; cache 1108 vectors (4.5M); **data/ vec_meta = 0** (no raw data expansion). Pre-landing review on the diff:
Luci
No PR — this is a local working-tree change. Review diff directly:
Luci
Self-review found one real footgun: `sync_cache` derives chunk text from current `search_fts`, but if a file changed since its vectors were stored (pending/stale), it'd map a new-text hash to an old vector — poisoning the cache. Guard: only backfill files whose `files.content_hash` matches `vec_meta.content_hash`.
Luci
Guard added. Re-run tests + re-verify live sync-cache still backfills cleanly:
Luci
All green: 9 tests pass, sync-cache backfills cleanly, dry-run 0 pending. Now council — per dev-loop, worker spawns a critical Claude subagent reviewer on the final diff.
Luci
Council verdict: 2 blockers. Let me assess against Rule 16 / practical risk:
- **CRITICAL #1 (cache never closed)**: WAL/journal sidecars are already gitignored (`.gitignore` lines 2-4: `*.db-journal/-wal/-shm`), so the git-leak concern is moot — but clean close is good hygiene. **Fix (cheap).**
- **IMPORTANT #4 (backend-switch footgun)**: This is correctness for the OpenAI backend the ticket asked for. Cheap to close properly via a `model` column + model-aware selection + migration backfill (so existing local vectors aren't spuriously re-embedded). **Fix.**
- **#8 (assert blobs non-None)**: cheap guardrail. **Fix.**
- #3/#5/#6/#10: low practical risk (migration-only tool, single-writer flock, pre-existing code) — defer. #9: reviewer couldn't see the task file because it lives at `~/workspace/tasks/` outside the PKA repo; I edited the correct file.
Applying the three fixes.
Luci
Now make selection model-aware + write `model` in reembed:
Luci
Now update `reembed` to write `model`, add the non-None assert (#8), and read current reembed region:
Luci
Now CRITICAL #1 — ensure cache always closes. Add context-manager to `EmbeddingCache` and wrap main. First the cache class:
Luci
Now read the main body to wrap cache lifecycle in try/finally:
Luci
Refactor the post-setup body into `_run()` so `main` can guarantee cache close via try/finally:
Luci
Verify structure compiles + tests pass:
Luci
Model-aware selection broke 3 pre-existing tests — their hand-rolled `vec_meta` lacks the new `model` column. Update the test fixture to match real schema:
Luci
Add a test for the new backend-swap detection (council #4 behavior):
Luci
10 pass. Critical live check — the live `vec_meta` lacks the `model` column. Run sync-cache (triggers migration+backfill), then dry-run must STILL be 0 (backfill = local model = active model, so no spurious full re-embed):
Luci
Bug found: migration backfill didn't persist — `ensure_schema`'s UPDATE is DML needing commit, but the sync-cache path never commits the vault connection (only the cache db). The in-memory backfill made dry-run read 0, but disk stayed NULL. Fix: commit the migration in `ensure_schema`.
Luci
Migration persisted (848 rows = local nomic), dry-run 0 pending, 10 tests pass. That migration-commit bug was a real catch from the review loop. Final verification snapshot:
Luci
All green: counts healthy, data/ excluded (0), cache 1108 portable vectors. Commit PKA files (vault.db stays gitignored; cache db is the portable artifact, git-tracked):
Luci
Committed (61cc88ea) — the Xref noise is the graphify post-commit hook scanning PDFs, harmless. Now commit the task file (separate workspace repo) and push both:
Luci
Both pushed — PKA `61cc88ea`, workspace `9a9e60ca`. Log activity + report done via mc-coord:
Luci
activity_log MCP is read-only (query). Commits + ticket DONE are the durable record. Report done: