Status and incidents

The single source of truth for the current state of AACsearch is status.aacsearch.com. It is hosted outside our primary infrastructure so it stays up when we don't.

Components on the status page

Component	What it covers
Search API — EU	Public search and ingest endpoints in the Frankfurt region.
Search API — US	Public search and ingest endpoints in the Virginia region.
Search API — RU	Public search and ingest endpoints in the Moscow region.
Dashboard	The web UI at `app.aacsearch.com`.
Connector API	The CMS-facing connector endpoints (PrestaShop, Bitrix, others).
Widget delivery	The hosted widget script + CSS at `app.aacsearch.com/api/widget/*`.
Webhooks	Outbound delivery of events to your endpoints.
Auth	Sign-in, magic links, OAuth, passkeys.

Each component is in one of four states:

Operational — green. Synthetic probes pass; error rate within SLO.
Degraded performance — yellow. Probes pass but latency or error rate is elevated.
Partial outage — orange. Some requests are failing.
Major outage — red. The component is broadly unavailable.

State transitions are written to the page within 60 seconds of detection.

Incident lifecycle

Investigating  →  Identified  →  Monitoring  →  Resolved
                                              ↓
                                       Post-mortem (within 14 days)

Phase	What it means
Investigating	We have detected an issue but don't yet know the cause.
Identified	We know what is broken and are working on a fix.
Monitoring	The fix is deployed; we are watching for recurrence before declaring resolved.
Resolved	The component has been healthy for at least 15 minutes.
Post-mortem	For any incident with customer impact ≥ 15 minutes, we publish a public PIR within 14 days.

We will not mark an incident Resolved until probes have been green for at least 15 minutes — even when we believe the fix is good. This is intentional. If we are wrong, we want to re-escalate, not pretend the recovery is over.

Subscribing to updates

From the status page footer, Subscribe offers:

Email — one address. Updates within ~30 seconds of state change.
SMS — for major outage components only (avoids spam during yellow events).
Webhook — JSON POST per state change. Use this to wire into PagerDuty, Opsgenie, or your own tooling.
RSS / Atom — for status dashboards and aggregators.

Enterprise customers also get incident notifications via their TAM for any P1 that affects their region.

Severity levels

We classify incidents at detection time and may upgrade or downgrade as we learn more.

Level	Customer impact	Response time
P1 — Critical	Search unavailable for an entire region or component widely impacted.	24/7, immediate.
P2 — High	Significant degradation but most requests succeed.	24/7, ≤ 15 minutes.
P3 — Medium	Localized issue (one feature, one tenant, one cluster).	Business hours, ≤ 4 h.
P4 — Low	Cosmetic, dashboard-only, or affecting non-production behavior.	Business hours, ≤ 1 d.

These are our internal response targets. Enterprise customers can negotiate tighter contractual response times — see Dedicated cluster.

What we communicate during an incident

A status update typically includes:

What is affected (component, region).
When it started (UTC, ISO 8601).
What we think the cause is, framed against our current uncertainty.
What we are doing (rollback, scale-up, failover, etc.).
Next update by (a specific time, even if no new information).

We do not post root cause guesses we're not confident in. If we don't know yet, we say "Investigating — next update at HH:MM UTC." This is by design.

What you should communicate to us during an incident

If you open a P1 or P2 ticket while a public incident is in progress, the most useful information is:

Your organization ID.
The index slug if the problem is index-specific.
The request ID from any failed request in your application logs (header x-request-id).
The first time you saw the problem (UTC).
Whether the problem is constant or intermittent, and what proportion of requests are affected.

We will not ask you for things you've already provided. We will ask if you can reproduce while leaving DevTools open — a single HAR sometimes resolves something faster than a 200-line description.

After an incident: post-mortems

For every incident with ≥ 15 minutes of customer impact:

We publish a public post-incident review at status.aacsearch.com/incidents/<id> within 14 days.
The PIR includes: timeline, contributing factors, customer impact (numbers, not adjectives), and concrete corrective actions with due dates.
We re-link the PIR from the original incident on the status page.

We do not name engineers responsible. We do name systems, services, and decisions.

Common mistakes

Assuming green status = your problem. A green status page means the cluster is healthy. Your problem might still be a misconfigured key or a 429. Use the troubleshooting decision tree before escalating.
Watching the dashboard during a P1. The dashboard depends on the same auth and database the API does; during a regional outage it may itself be slow. The status page is independent.
Re-asking for an ETA every 5 minutes. Updates are posted at the cadence we promise. If we said "next update by 14:30 UTC", we will post by 14:30 UTC.

Status and incidents

Status and incidents

Components on the status page

Incident lifecycle

Subscribing to updates

Severity levels

What we communicate during an incident

What you should communicate to us during an incident

After an incident: post-mortems

Common mistakes

See also

On this page