AACsearch
Operations & Reliability

Status and incidents

How to read the AACsearch status page and what to expect during an incident.

Status and incidents

The single source of truth for the current state of AACsearch is status.aacsearch.com. It is hosted outside our primary infrastructure so it stays up when we don't.

Components on the status page

ComponentWhat it covers
Search API — EUPublic search and ingest endpoints in the Frankfurt region.
Search API — USPublic search and ingest endpoints in the Virginia region.
Search API — RUPublic search and ingest endpoints in the Moscow region.
DashboardThe web UI at app.aacsearch.com.
Connector APIThe CMS-facing connector endpoints (PrestaShop, Bitrix, others).
Widget deliveryThe hosted widget script + CSS at app.aacsearch.com/api/widget/*.
WebhooksOutbound delivery of events to your endpoints.
AuthSign-in, magic links, OAuth, passkeys.

Each component is in one of four states:

  • Operational — green. Synthetic probes pass; error rate within SLO.
  • Degraded performance — yellow. Probes pass but latency or error rate is elevated.
  • Partial outage — orange. Some requests are failing.
  • Major outage — red. The component is broadly unavailable.

State transitions are written to the page within 60 seconds of detection.

Incident lifecycle

Investigating  →  Identified  →  Monitoring  →  Resolved

                                       Post-mortem (within 14 days)
PhaseWhat it means
InvestigatingWe have detected an issue but don't yet know the cause.
IdentifiedWe know what is broken and are working on a fix.
MonitoringThe fix is deployed; we are watching for recurrence before declaring resolved.
ResolvedThe component has been healthy for at least 15 minutes.
Post-mortemFor any incident with customer impact ≥ 15 minutes, we publish a public PIR within 14 days.

We will not mark an incident Resolved until probes have been green for at least 15 minutes — even when we believe the fix is good. This is intentional. If we are wrong, we want to re-escalate, not pretend the recovery is over.

Subscribing to updates

From the status page footer, Subscribe offers:

  • Email — one address. Updates within ~30 seconds of state change.
  • SMS — for major outage components only (avoids spam during yellow events).
  • Webhook — JSON POST per state change. Use this to wire into PagerDuty, Opsgenie, or your own tooling.
  • RSS / Atom — for status dashboards and aggregators.

Enterprise customers also get incident notifications via their TAM for any P1 that affects their region.

Severity levels

We classify incidents at detection time and may upgrade or downgrade as we learn more.

LevelCustomer impactResponse time
P1 — CriticalSearch unavailable for an entire region or component widely impacted.24/7, immediate.
P2 — HighSignificant degradation but most requests succeed.24/7, ≤ 15 minutes.
P3 — MediumLocalized issue (one feature, one tenant, one cluster).Business hours, ≤ 4 h.
P4 — LowCosmetic, dashboard-only, or affecting non-production behavior.Business hours, ≤ 1 d.

These are our internal response targets. Enterprise customers can negotiate tighter contractual response times — see Dedicated cluster.

What we communicate during an incident

A status update typically includes:

  • What is affected (component, region).
  • When it started (UTC, ISO 8601).
  • What we think the cause is, framed against our current uncertainty.
  • What we are doing (rollback, scale-up, failover, etc.).
  • Next update by (a specific time, even if no new information).

We do not post root cause guesses we're not confident in. If we don't know yet, we say "Investigating — next update at HH:MM UTC." This is by design.

What you should communicate to us during an incident

If you open a P1 or P2 ticket while a public incident is in progress, the most useful information is:

  • Your organization ID.
  • The index slug if the problem is index-specific.
  • The request ID from any failed request in your application logs (header x-request-id).
  • The first time you saw the problem (UTC).
  • Whether the problem is constant or intermittent, and what proportion of requests are affected.

We will not ask you for things you've already provided. We will ask if you can reproduce while leaving DevTools open — a single HAR sometimes resolves something faster than a 200-line description.

After an incident: post-mortems

For every incident with ≥ 15 minutes of customer impact:

  1. We publish a public post-incident review at status.aacsearch.com/incidents/<id> within 14 days.
  2. The PIR includes: timeline, contributing factors, customer impact (numbers, not adjectives), and concrete corrective actions with due dates.
  3. We re-link the PIR from the original incident on the status page.

We do not name engineers responsible. We do name systems, services, and decisions.

Common mistakes

  • Assuming green status = your problem. A green status page means the cluster is healthy. Your problem might still be a misconfigured key or a 429. Use the troubleshooting decision tree before escalating.
  • Watching the dashboard during a P1. The dashboard depends on the same auth and database the API does; during a regional outage it may itself be slow. The status page is independent.
  • Re-asking for an ETA every 5 minutes. Updates are posted at the cadence we promise. If we said "next update by 14:30 UTC", we will post by 14:30 UTC.

See also

On this page