AACsearch
Operations & Reliability

Observability

Request IDs, structured logs, analytics events, and the way AACsearch correlates with your error tracking.

Observability

Monitoring is the four signals to watch. Observability is the plumbing underneath — request IDs that cross system boundaries, the logging conventions we follow, the event taxonomy in analytics, and how the pieces correlate.

This page is mostly relevant when you're integrating AACsearch into a larger system that already has its own request-tracing, error-tracking, and event pipelines.

Request IDs

Every request to the public API generates an x-request-id header on the response. The value is also written to:

  • The audit log row (in details.requestId when applicable).
  • Structured server logs (the requestContext middleware sets requestId, orgId, indexId).
  • Webhook delivery records.
  • Sentry breadcrumbs (when Sentry is enabled on our side).

Propagate this ID through your own systems. Capture it from the response header, write it into your application logs, and include it in your error tracker's extra payload. When you escalate to support, that one value lets us join your logs to ours.

You can also send your own request ID by setting x-request-id on the request. We will use the value you send if it's non-empty and ≤ 128 chars, and you'll see the same value on the response. This is the easiest way to make a single ID flow through a multi-service request.

Logging conventions

We use pino for structured logging on the server side, via @repo/logs. Every log line has at minimum:

FieldMeaning
leveltrace / debug / info / warn / error / fatal
timeUnix ms.
requestIdIf the call is in a request context.
orgIdVerified organization (after auth gate).
indexIdIf applicable.
msgShort, parseable summary.

In a stack trace we additionally include err.name, err.message, and a sampled err.stack. We do not log request bodies by default — that would leak document payloads.

If you're running self-hosted (when available — see Dedicated cluster), the same conventions apply. The log level is set by LOG_LEVEL env var; default is info.

Analytics event taxonomy

AACsearch records analytics events that are useful both for product analytics and for operational debugging. The taxonomy:

Event nameWhen firedNotable properties
search.executedEvery successful public search.indexId, query, hits, latencyMs
search.zero_resultsSearch returned 0 hits.indexId, query
search.result_clickedClick-through reported via the events endpoint.indexId, documentId, position
search.rate_limitedRequest denied at the rate-limit gate.keyId, limit
ingest.bulk_upsertBuffer flush succeeded for a batch.indexId, count
ingest.bulk_upsert_failedBuffer flush failed for a batch.indexId, count, errorClass
index.reindex_startedReindex kicked off.indexId, fromVersion, toVersion
index.reindex_completedReindex succeeded; alias swapped.indexId, toVersion, durationMs
index.reindex_failedReindex aborted.indexId, toVersion, errorClass
webhook.delivery_failedWebhook target returned non-2xx.endpoint, attempt, statusCode

Events are accessible via searchAnalytics.list (dashboard data) and searchAnalytics.export (NDJSON). Retention: 30 days for high-volume events, 365 days for index.*.

For product analytics, PostHog is wired up for dashboard events. You can install your own PostHog project key in your application; the AACsearch dashboard uses ours and the two streams don't mix.

Error correlation with Sentry

If your application uses Sentry, the recommended pattern:

import * as Sentry from "@sentry/nextjs";

try {
	const res = await fetch(url, { headers });
	if (!res.ok) {
		const requestId = res.headers.get("x-request-id");
		throw new Error(`search failed: ${res.status} (requestId=${requestId})`);
	}
} catch (err) {
	Sentry.captureException(err, {
		tags: { component: "aacsearch", requestId },
		extra: { url, status: res?.status },
	});
	throw err;
}

Tagging by requestId means a Sentry search by request ID returns every event we've seen on your side for that request. When you forward a Sentry link to support, we can match it to our server log on the same ID.

We do not ingest your Sentry stream on our side. The correlation is by shared request ID, not by shared infrastructure.

Health endpoints

EndpointAuthReturns
GET /api/v1/healthNone{ "status": "ok" } (200) when the API gateway is up.
GET /api/v1/health/regionNonePer-region cluster reachability snapshot.
GET /api/orpc/searchIndex.getHealthBearer (admin)Per-index drift / lag / error rate.

The unauthenticated endpoints are rate-limit-exempt. Use them for synthetic probes; don't use authenticated endpoints for that purpose because they will consume your rate budget.

Webhook delivery as observability

When you configure outbound webhooks, the deliveries dashboard is itself an observability surface:

  • Tracks last 1 000 attempts per endpoint.
  • Shows latency percentiles.
  • Lists the dead-letter queue (failed past 24 h of retries).

If you're using webhooks for system-of-record updates, also poll for the same data once a day. Webhooks are best-effort with retry; they are not a replacement for a reconciliation loop.

Dashboards and alerts

We do not provide pre-built Grafana / Datadog dashboards as a self-serve artifact today. The supported integration path is:

  1. Poll searchIndex.getHealth from a worker on your side.
  2. Emit metrics into your stack with consistent tags (orgId, indexId, region).
  3. Alert on the thresholds from Monitoring.

A small reference worker is available on request — email support@aacsearch.com and we'll send a tarball.

Incident severity in your own observability

Our public severity model is described in Status and incidents. When you stitch alerts in your own stack, the loose mapping:

  • Page someone (P1-equivalent on your side) when aacsearch.index.docCountDrift > 0.05 sustained for 5 minutes, or aacsearch.index.ingestLagSeconds > 600 sustained.
  • Open a ticket (P2/P3) when errorRate > 0.01 sustained or rate_limited > 0 sustained.
  • Log only (no page) when latencyMs p95 increases by a static amount — variance there is mostly your network.

Common mistakes

  • No request ID propagation. Without it, joining your logs to ours during an incident is expensive (we have to brute-force on org ID + minute window).
  • Alerting on every event. The PostHog event stream is for analysis, not alerts. Page off metrics; analyze events.
  • Treating webhooks as observability. Webhooks are notifications, not your source of truth. They retry but they're not transactional.

See also

On this page