AI Answers

Generate a one-paragraph AI answer over the top search hits, with citations back to the original documents.

The AI answer endpoint runs a search, takes the top hits, and asks an LLM to produce a short answer grounded in those hits. The result is a paragraph plus a list of cited source documents from the same index — never a free-form generation.

Endpoint

POST /api/search/ai/answer
Authorization: Bearer ss_search_<your-key>
Content-Type: application/json

The endpoint is mounted in packages/api/modules/search/ai-search-public.ts and goes through the same public-search gate as /api/search (rate limiting, scoped token verification, tenant isolation).

Request

{
  query: string;              // ≤ 2000 chars, trimmed
  indexSlug?: string;         // default: "products"
  queryBy?: string;           // default: "title,description"
  filterBy?: string;          // standard filter expression
  perPage?: number;           // hits used for grounding; 1–20, default 5
}

If the caller uses a scoped search token that pins indexSlug, any mismatching indexSlug returns 403 invalid_index.

Response

{
  answer: string;                  // 1–3 sentence answer (≤ 300 tokens)
  sources: Array<{
    id: string;
    title: string;
    url?: string;
    imageUrl?: string;
    price?: number | string;
  }>;
  searchTimeMs: number;            // time spent in keyword search
  totalTimeMs: number;             // wall-clock time across the whole call
}

sources is the same order as the hit list, capped at perPage. Use sources to render citation chips below the answer. Documents that fall outside the top hits are not cited.

Citations and grounding

The prompt sent to the model contains the top 5 hits' title + first 300 chars of description as numbered context lines:

[1] Bose QuietComfort 45: Noise-cancelling over-ear headphones with…
[2] Sony WH-1000XM5: Industry-leading active noise cancelling…
…

The model is instructed to answer in 1–3 sentences using only the provided context. The temperature is fixed at 0.3 so the same query produces a stable answer over time, and max_tokens is 300 so latency stays predictable.

The model returns plain text — the system does not require the model to emit [1] markers itself; the sources list is built from the search hits regardless of what the model wrote. If the model fabricates a brand or product that isn't in sources, that's a grounding failure: the answer paragraph cannot be trusted standalone.

Always render the sources chips next to the answer. A confident-sounding paragraph with zero citations is the most common failure mode of any RAG system.

Confidence

The current endpoint does not return an explicit confidence score. Treat the number of hits as a proxy:

Hits returned (`found`)	Treatment
0	Skip the AI call entirely — render the no-results state.
1–2	Render the answer but expect low coverage; show citations prominently.
3+	Standard rendering; the prompt sees at least 3 distinct context lines.

If the answer comes back as an empty string (answer === ""), the OpenAI call failed; the endpoint returns the search hits anyway and releases the AI reservation. Render results without the answer panel.

Image-to-vector search

There is also an image companion endpoint:

POST /api/search/ai/image

It runs gpt-4o-mini against an uploaded image, takes the textual description, embeds it, and runs a vector search against the index. Useful for "find products similar to this photo" widgets.

Inputs:

{
  imageUrl?: string;   // OR
  imageBase64?: string;
  indexSlug?: string;
  perPage?: number;
}

Output mirrors the standard search response with vector_distance on each hit. See ai-search-public.ts for the exact shape — the response is stable but not yet promoted out of Beta for image search.

Billing

The AI answer is metered through the AI Wallet (Invariant 8). The lifecycle:

Reserve — before the search call, the system reserves CREDIT_RATES.ai_answer kopecks against the org wallet. Insufficient balance → 402 Payment Required; the endpoint never runs.
Commit / release — on success, the reservation is committed as actual usage. On any error (bad input, LLM failure, mismatched scoped token), the reservation is released and the org is not charged.

Rates are published in packages/api/modules/entitlements/credit-rates.ts. Display the per-call cost in your widget so operators can size the budget.

Rate limiting

The endpoint reuses the public search rate limit (SearchRateLimitBucket). A 429 response means the org has exhausted its plan quota for the current window, not that the AI service is overloaded. The 429 response body is the standard { error: "rate_limit_exceeded", retryAfter: number } shape — never the raw upstream error (Invariant 6).

Limitations

Latency. Expect 500–2000 ms over the search call. Show the search hits first, then stream / fill in the answer panel.
Language. The default prompt is English. For non-English catalogs configure queryBy / filterBy: "locale:=ru" to keep hits monolingual; the model will follow the language of the context.
Hallucination floor. Even with grounding, the model can over-generalise ("This is a great product…"). Mitigation: limit the answer panel to product-category questions, not free-form recommendations.
No streaming. The current /api/search/ai/answer returns the full response in one shot. For long answers over your own documents, use Knowledge RAG streaming (knowledge.askStream) instead.
Single index. The endpoint searches one index. For federated answers, run multi-search and stitch results yourself.

AI Search overview
Semantic search — vector / hybrid mode used by the same endpoint
Public search endpoint — keyword search beneath AI answers
Knowledge RAG — Q&A over your own documents, streaming
Plans and limits — AI wallet rates and overage policy

AI Answers

On this page