Monitoring
The signals to watch when you run AACsearch in production — and how to wire them into your observability stack.
Monitoring
This page is about your monitoring of AACsearch, not ours. We watch our cluster ourselves — that's what the status page reports. What you need to watch is the relationship between your application and AACsearch: which keys are 429-ing, which indexes are drifting, how much your reindexes cost.
For a deeper observability-stack integration (request IDs, Sentry correlation, PostHog events), see Observability.
The four signals worth watching
If you only watch four numbers, watch these:
| Signal | What it tells you | Where to find it |
|---|---|---|
| Doc count drift | Typesense doc count vs the count your buffer expects. | Dashboard → Indexes → Health badge. |
| Ingest lag | Seconds since the oldest unprocessed row in the buffer. | Dashboard → Diagnostics. |
| Error rate | Ratio of failed / (success + failed) ingest rows in the last 24 hours. | Dashboard → Diagnostics. |
| Rate-limit 429s | Counts of rate_limited responses, per key. | Dashboard → API keys → Usage. |
Each of these is also available as an oRPC procedure if you want to pull it into your own dashboards. The procedures and their thresholds:
| Signal | Procedure | Healthy threshold |
|---|---|---|
| Doc count drift | searchIndex.getHealth.driftPercent | Absolute value < 5 % |
| Ingest lag | searchIndex.getHealth.ingestLagSeconds | < 300 (5 minutes) |
| Error rate | searchIndex.getHealth.errorRate | < 0.01 (1 %) |
| Rate-limit 429s | searchApiKey.getUsage | Sustained < 80 % of rateLimitPerMinute |
Thresholds match what the dashboard considers "yellow" — if you alert at the dashboard's yellow, you're alerting at our thresholds. Tighter is fine.
Health badge → action mapping
The Indexes list shows a colored badge per index. Here is what each badge means and the first thing to do.
Green
Everything in normal range. No action.
Yellow — ingest lag
Buffer is taking longer than 5 minutes to flush. Look at:
- Dashboard → Diagnostics. The most recent rows show error messages if any.
- The status page for your region. A degraded Typesense cluster will inflate lag.
- Your own writers. Did you start a backfill that 10× your normal write volume? The buffer is doing its job — back-pressuring you.
Yellow — error rate
More than 1 % of flushes are failing. Causes in rough order of frequency:
- Document validation. A new field shape isn't accepted. Open Diagnostics; the error message points at the offending document.
- Typesense memory pressure. Rare on the shared cluster; common when an index is far over its plan limits.
- Network blip. Look for an
ETIMEDOUTpattern. If the cluster status is green and the rate has cleared on its own, it was a blip and no action is needed.
Red — doc count drift
Typesense reports a document count that is more than 5 % off the count your buffer expects. This is always worth investigating; it usually means one of:
- A reindex was interrupted and didn't clean up.
- A large delete-by-filter ran that the dashboard counter hasn't caught up with yet.
- A bug in your application that's writing duplicate IDs or missing them entirely.
Open a ticket if you don't recognize any of the three.
Wiring metrics into your stack
The simplest path: schedule a small worker that polls searchIndex.getHealth for each index every minute and emits to your metrics pipeline.
import { client } from "@repo/api/client";
for (const indexId of allIndexIds) {
const health = await client.searchIndex.getHealth.call({ indexId });
emit("aacsearch.index.docCountDrift", health.docCountDrift, { indexId });
emit("aacsearch.index.ingestLagSeconds", health.ingestLagSeconds, { indexId });
emit("aacsearch.index.errorRate", health.errorRate, { indexId });
}Or use the browser SDK's getHealth from a Node process. Either way, the data is the same; just don't poll faster than once per 30 seconds — the values don't change faster than that.
Webhook delivery health
If you've configured webhooks (events on document changes, reindex completion, etc.), the Dashboard → Webhooks → Deliveries view shows:
- Last 1 000 attempts.
- 2xx vs 4xx vs 5xx breakdown.
- p50 / p95 / p99 latency.
- Retry queue depth (events waiting to be re-attempted).
A growing retry queue means your endpoint is slow or returning 5xx. Webhooks retry with exponential backoff for 24 hours, then go to a dead-letter list that you can manually replay from.
Rate-limit usage
Every API key has a rateLimitPerMinute budget. The dashboard's API keys → Usage chart shows per-minute usage over the last 24 hours.
Two questions to answer:
- Is any key sustaining > 80 % of its budget? That key needs either a higher limit (talk to support) or a load reduction.
- Is any 429 happening in normal traffic? A 429 in normal traffic means the limit is too low; a 429 only during peaks may be acceptable if your client retries with backoff.
See Rate limits and quotas for the full model.
What you should NOT use for monitoring
- Search response latency from end-user devices without a synthetic baseline. Mobile networks dominate the variance; you'll be alerting on Wi-Fi quality at coffee shops.
- Sentry alert on every 4xx. 4xx is the API doing its job (e.g. validation, rate limit). Alert on 5xx; investigate sustained 4xx in a dashboard.
/api/v1/indexesHEAD pings. This is cheap for us but consumes your rate limit. Use the unauthenticated/api/v1/healthendpoint if you need a synthetic probe.
Synthetic probes
The unauthenticated, rate-limit-exempt health endpoint is GET /api/v1/health. It returns 200 { "status": "ok" } when the API gateway is up. It does not verify search; for that, run a real search against a known-good index from your application's region with a search-scope key, and emit the latency.
Common mistakes
- Reading the badge once a day. Yellow → red can happen in 10 minutes. Polling every 30–60 seconds is the right cadence; the dashboard caches values for 30 seconds anyway.
- Alerting on absolute doc count. Absolute counts move every time someone uploads. Alert on drift (the signal that says "they should be equal but aren't").
- Ignoring the audit log. Half of "why did the index suddenly change shape?" questions are answered by
audit_log.action = update_index_settingsfrom yesterday. See Audit logs.
See also
- Observability — request IDs, Sentry correlation, event taxonomy
- Status and incidents — our side of monitoring
- Rate limits and quotas — what a 429 means and how to recover
- Troubleshooting — decision tree for specific failures