AACsearch
Knowledge RAG

Evaluating Knowledge RAG Quality

A pragmatic workflow for measuring and improving answer quality — retrieval, faithfulness, coverage — without a research lab.

A Knowledge RAG system has three places it can fail:

  1. Retrieval failure — the relevant chunk isn't in the top-k results.
  2. Generation failure — the chunk is there but the model summarises it incorrectly or adds detail that isn't in the chunk.
  3. Coverage failure — the answer exists in your documents but the wrong ones were uploaded, or the document was uploaded but split into useless chunks.

Evaluating answer quality means catching all three. This page is a pragmatic workflow — no PhD required — that fits the data and tooling AACSearch already exposes.

What "good" looks like

Pick a small set (10–30) of representative questions for each Knowledge space. For each question, write down:

  • The expected answer (short, factual).
  • The expected source document(s) — the file(s) that contain the answer.
  • The expected behavior on a miss — should the system say "I don't know" or surface a related but partial answer?

This is your evaluation set. It is the most valuable artifact in any RAG project and the only honest way to detect drift. Re-run it after every ingest config change.

Three metrics you can run today

1. Source recall

For each evaluation question, run knowledge.ask and check whether the expected source document appears in result.sources.

const result = await orpc.knowledge.ask.call({
  spaceId,
  question: eval.question,
  topK: 5,
});

const recallHit = result.sources.some((src) => src.id === eval.expectedDocumentId);
  • Source recall @ 5 — fraction of the eval set where the right doc is in the top 5 sources.
  • Target: 0.85+ for a usable system. Below 0.7 → retrieval is the bottleneck, not generation.

If recall is low, do not tune the answer prompt — the answer prompt can only work with what it's given. Fix retrieval first.

2. Faithfulness

Faithfulness asks: "Is the answer only using information present in the cited chunks?"

A cheap manual check: for each eval question, copy the chunks (result.chunks) into your notes, then read the answer paragraph and underline anything not supported by the chunks. Anything underlined is a hallucination.

A semi-automated check: re-prompt the LLM with the chunks and the answer and ask it to flag unsupported sentences. This is what evaluation frameworks like RAGAS automate; the principle is the same.

3. Answer accuracy

The hardest one and the only one that maps to user satisfaction. For each eval question, score the answer:

ScoreMeaning
2Correct and complete.
1Partially correct — covers the main point but misses nuance or context.
0Incorrect, irrelevant, or a confident hallucination.
–1Refused appropriately ("I don't know") when the answer is not in the docs.

Average over the eval set. Track over time after every config change (chunk size, top-k, model swap).

Diagnosing common failure modes

SymptomLikely causeFix
Answer cites the right doc but states the wrong fact.Chunk too small — the surrounding context was split off.Increase chunkSize and overlap. Re-ingest the source.
Answer cites an unrelated doc.Retrieval pulled a chunk on a tangential keyword.Use topK ≥ 8 and rely on the model to ignore noise; or add a metadata filter (Beta).
Answer is empty / refuses everything.The LLM cap kicked in (max_tokens too low) or chunks are empty.Check IngestionJob.processedItems — if zero, the parser failed silently. Inspect the parser logs.
Answers are right in dev but wrong in prod.Different embedding model between envs.Pin KnowledgeSpace.ragConfig.embeddingModel and re-ingest under the prod model.
Fluctuating answers for the same question.Temperature too high.Default is 0.3 for the public AI answer; for ask you can lower it via space config.
Slow ask (> 5 s).Embedding the question or a cold LLM client.Check result.timings (Beta) — if embedding time dominates, switch to text-embedding-3-small.

Telemetry to watch

The Knowledge module emits events into SearchUsageEvent (the same table as keyword search) with eventType = "knowledge_ask". From the Analytics dashboard:

  • Volumeknowledge_ask per day, broken down by space.
  • Token spend — sum of tokensInput / tokensOutput per period.
  • Refusal rate — fraction of answers that include the literal "I don't know" / "the documents don't" template. A spike usually means a recent ingest failed.
  • Latency — p50 / p95 / p99 of total time on knowledge_ask.

Combine these with your manual eval scores to detect drift early.

Improving quality, in order

When your evaluation says quality is bad, work the checklist in this order — each step costs an order of magnitude more than the previous one:

  1. Audit the eval set. Are the expected answers actually in the uploaded documents? Half the time the gap is content, not RAG.
  2. Re-ingest with a sensible chunking strategy. Markdown for Markdown sources, semantic for prose, code for code. Defaults are not always best.
  3. Tune chunk size and overlap. Try chunkSize: 400, overlap: 50 as a first iteration; benchmark recall before/after.
  4. Increase topK from 5 to 8 or 10. Cheap; usually improves recall.
  5. Switch embedding model. From text-embedding-3-small to -large. Doubles cost and dimensions; re-ingest required.
  6. Use GraphRAG for multi-document questions. See GraphRAG use cases.
  7. Add metadata filters (Beta) — narrow retrieval to a subset of documents by tag / source / date.
  8. Per-tenant fine-tuning (Roadmap, Enterprise).

Privacy and PII in evaluation sets

Eval sets often contain PII because real user questions contain PII. Keep your eval store separate from the Knowledge space being evaluated, and avoid re-ingesting the eval questions themselves — they shouldn't show up in answers.

If you can't avoid PII in eval, redact before checking into version control: any DPA / SOC 2 evidence still requires you to treat the eval store as personal-data scope.

When evaluation says "good enough"

A 30-question eval set with consistent source-recall ≥ 0.85 and accuracy ≥ 1.5 (out of 2) is enough to ship to a controlled audience. Below that, the cost of bad answers — wasted user time, support escalations, eroded trust — usually outweighs the value of the AI layer.

If you can't get above that threshold, the cleanest fallback is to render retrieved passages as a list (without an LLM summary) and let the user read. That's effectively semantic search; see AI Search → Semantic search.

On this page