Locamem — Benchmarks

Retrieval quality

Session recall on LongMemEval-S

Did the store surface a gold-evidence session in the top-k? Measured across all 500 questions, embeddings off (the air-gapped path).

k	Session recall@k	Notes
@10	99.4% (497/500)	Only 3 misses across the whole set.
@20	99.6% (498/500)	Default reader window.
@50	100% (500/500)	Every gold session is in the candidate set.

Speed & cost

Sub-10ms recall, $0 per call

Metric	Locamem (local)	Cloud memory (typical)
Recall latency	~9 ms (SimHash band + FTS5; hybrid)	Network round trip: tens–hundreds of ms + tail latency
Write / ingest	~3.8 ms per memory	API write + async indexing
API calls / recall	0	≥1 (vector search), often + embedding call
Cost / 1,000 recalls	$0	Metered API + embedding/model cost
Footprint	One SQLite file; runs on a laptop	Managed datastore + vector index, server-side

How it's measured

Methodology & reproducibility

Dataset

LongMemEval-S

500 long-horizon questions over ~48-session haystacks — the standard long-term-memory benchmark. We report session-level retrieval recall (the metric that decides whether the reader even sees the evidence).

Engine

SimHash + FTS5

64-bit SimHash (LSH) ∪ FTS5 full-text ∪ optional on-device embeddings, fused and scored with a per-facet breakdown. Embeddings off for the air-gapped numbers above.

No overfitting

Firewalled

The QA path never reads the answer or answer_session_ids — a runtime assert enforces it. Numbers come from the public set as a report, not a tuning signal.

# reproduce, end to end
git clone https://github.com/TeamWilcoe/locamem && cd locamem
python benchmarks/build_failure_dossier.py   # session recall@10/20/50, CPU-only
python benchmarks/bench_longmemeval_qa.py --use-solvers --model claude  # end-to-end

Where we're honest

Retrieval ≠ end-to-end accuracy

We separate the two on purpose, and we don't claim to beat anyone on answer accuracy.

Retrieval (our headline)

~99% recall@10

The store reliably surfaces the right evidence. This is what Locamem owns, on-device, at $0 — and it's genuinely strong.

End-to-end QA (reader-limited)

~58%, and we say so

End-to-end accuracy depends on the reader model, not retrieval. Even an oracle GPT-4o handed perfect evidence tops out near 82%. We report ~58% with the current reader and treat the gap as a reader problem — not a retrieval one, and not a claim of accuracy superiority over cloud products.

Roadmap: RRF rank-fusion + a local cross-encoder rerank are the next retrieval upgrades; reader-side anti-hedge + aggregation work targets the end-to-end gap. Both are tracked openly in the repo.

See it for yourself

Run the live recall demo, then install in one line.

No account. No keys. One SQLite file and an MCP server, on your machine.

$ curl -fsSL https://locamem.com/install | bash

Run the demo →