AACsearch
Operations & Reliability

Monitoring

The signals to watch when you run AACsearch in production — and how to wire them into your observability stack.

Monitoring

This page is about your monitoring of AACsearch, not ours. We watch our cluster ourselves — that's what the status page reports. What you need to watch is the relationship between your application and AACsearch: which keys are 429-ing, which indexes are drifting, how much your reindexes cost.

For a deeper observability-stack integration (request IDs, Sentry correlation, PostHog events), see Observability.

The four signals worth watching

If you only watch four numbers, watch these:

SignalWhat it tells youWhere to find it
Doc count driftTypesense doc count vs the count your buffer expects.Dashboard → Indexes → Health badge.
Ingest lagSeconds since the oldest unprocessed row in the buffer.Dashboard → Diagnostics.
Error rateRatio of failed / (success + failed) ingest rows in the last 24 hours.Dashboard → Diagnostics.
Rate-limit 429sCounts of rate_limited responses, per key.Dashboard → API keys → Usage.

Each of these is also available as an oRPC procedure if you want to pull it into your own dashboards. The procedures and their thresholds:

SignalProcedureHealthy threshold
Doc count driftsearchIndex.getHealth.driftPercentAbsolute value < 5 %
Ingest lagsearchIndex.getHealth.ingestLagSeconds< 300 (5 minutes)
Error ratesearchIndex.getHealth.errorRate< 0.01 (1 %)
Rate-limit 429ssearchApiKey.getUsageSustained < 80 % of rateLimitPerMinute

Thresholds match what the dashboard considers "yellow" — if you alert at the dashboard's yellow, you're alerting at our thresholds. Tighter is fine.

Health badge → action mapping

The Indexes list shows a colored badge per index. Here is what each badge means and the first thing to do.

Green

Everything in normal range. No action.

Yellow — ingest lag

Buffer is taking longer than 5 minutes to flush. Look at:

  1. Dashboard → Diagnostics. The most recent rows show error messages if any.
  2. The status page for your region. A degraded Typesense cluster will inflate lag.
  3. Your own writers. Did you start a backfill that 10× your normal write volume? The buffer is doing its job — back-pressuring you.

Yellow — error rate

More than 1 % of flushes are failing. Causes in rough order of frequency:

  1. Document validation. A new field shape isn't accepted. Open Diagnostics; the error message points at the offending document.
  2. Typesense memory pressure. Rare on the shared cluster; common when an index is far over its plan limits.
  3. Network blip. Look for an ETIMEDOUT pattern. If the cluster status is green and the rate has cleared on its own, it was a blip and no action is needed.

Red — doc count drift

Typesense reports a document count that is more than 5 % off the count your buffer expects. This is always worth investigating; it usually means one of:

  • A reindex was interrupted and didn't clean up.
  • A large delete-by-filter ran that the dashboard counter hasn't caught up with yet.
  • A bug in your application that's writing duplicate IDs or missing them entirely.

Open a ticket if you don't recognize any of the three.

Wiring metrics into your stack

The simplest path: schedule a small worker that polls searchIndex.getHealth for each index every minute and emits to your metrics pipeline.

import { client } from "@repo/api/client";

for (const indexId of allIndexIds) {
	const health = await client.searchIndex.getHealth.call({ indexId });
	emit("aacsearch.index.docCountDrift", health.docCountDrift, { indexId });
	emit("aacsearch.index.ingestLagSeconds", health.ingestLagSeconds, { indexId });
	emit("aacsearch.index.errorRate", health.errorRate, { indexId });
}

Or use the browser SDK's getHealth from a Node process. Either way, the data is the same; just don't poll faster than once per 30 seconds — the values don't change faster than that.

Webhook delivery health

If you've configured webhooks (events on document changes, reindex completion, etc.), the Dashboard → Webhooks → Deliveries view shows:

  • Last 1 000 attempts.
  • 2xx vs 4xx vs 5xx breakdown.
  • p50 / p95 / p99 latency.
  • Retry queue depth (events waiting to be re-attempted).

A growing retry queue means your endpoint is slow or returning 5xx. Webhooks retry with exponential backoff for 24 hours, then go to a dead-letter list that you can manually replay from.

Rate-limit usage

Every API key has a rateLimitPerMinute budget. The dashboard's API keys → Usage chart shows per-minute usage over the last 24 hours.

Two questions to answer:

  1. Is any key sustaining > 80 % of its budget? That key needs either a higher limit (talk to support) or a load reduction.
  2. Is any 429 happening in normal traffic? A 429 in normal traffic means the limit is too low; a 429 only during peaks may be acceptable if your client retries with backoff.

See Rate limits and quotas for the full model.

What you should NOT use for monitoring

  • Search response latency from end-user devices without a synthetic baseline. Mobile networks dominate the variance; you'll be alerting on Wi-Fi quality at coffee shops.
  • Sentry alert on every 4xx. 4xx is the API doing its job (e.g. validation, rate limit). Alert on 5xx; investigate sustained 4xx in a dashboard.
  • /api/v1/indexes HEAD pings. This is cheap for us but consumes your rate limit. Use the unauthenticated /api/v1/health endpoint if you need a synthetic probe.

Synthetic probes

The unauthenticated, rate-limit-exempt health endpoint is GET /api/v1/health. It returns 200 { "status": "ok" } when the API gateway is up. It does not verify search; for that, run a real search against a known-good index from your application's region with a search-scope key, and emit the latency.

Common mistakes

  • Reading the badge once a day. Yellow → red can happen in 10 minutes. Polling every 30–60 seconds is the right cadence; the dashboard caches values for 30 seconds anyway.
  • Alerting on absolute doc count. Absolute counts move every time someone uploads. Alert on drift (the signal that says "they should be equal but aren't").
  • Ignoring the audit log. Half of "why did the index suddenly change shape?" questions are answered by audit_log.action = update_index_settings from yesterday. See Audit logs.

See also

On this page