Status and incidents
How to read the AACsearch status page and what to expect during an incident.
Status and incidents
The single source of truth for the current state of AACsearch is status.aacsearch.com. It is hosted outside our primary infrastructure so it stays up when we don't.
Components on the status page
| Component | What it covers |
|---|---|
| Search API — EU | Public search and ingest endpoints in the Frankfurt region. |
| Search API — US | Public search and ingest endpoints in the Virginia region. |
| Search API — RU | Public search and ingest endpoints in the Moscow region. |
| Dashboard | The web UI at app.aacsearch.com. |
| Connector API | The CMS-facing connector endpoints (PrestaShop, Bitrix, others). |
| Widget delivery | The hosted widget script + CSS at app.aacsearch.com/api/widget/*. |
| Webhooks | Outbound delivery of events to your endpoints. |
| Auth | Sign-in, magic links, OAuth, passkeys. |
Each component is in one of four states:
- Operational — green. Synthetic probes pass; error rate within SLO.
- Degraded performance — yellow. Probes pass but latency or error rate is elevated.
- Partial outage — orange. Some requests are failing.
- Major outage — red. The component is broadly unavailable.
State transitions are written to the page within 60 seconds of detection.
Incident lifecycle
Investigating → Identified → Monitoring → Resolved
↓
Post-mortem (within 14 days)| Phase | What it means |
|---|---|
| Investigating | We have detected an issue but don't yet know the cause. |
| Identified | We know what is broken and are working on a fix. |
| Monitoring | The fix is deployed; we are watching for recurrence before declaring resolved. |
| Resolved | The component has been healthy for at least 15 minutes. |
| Post-mortem | For any incident with customer impact ≥ 15 minutes, we publish a public PIR within 14 days. |
We will not mark an incident Resolved until probes have been green for at least 15 minutes — even when we believe the fix is good. This is intentional. If we are wrong, we want to re-escalate, not pretend the recovery is over.
Subscribing to updates
From the status page footer, Subscribe offers:
- Email — one address. Updates within ~30 seconds of state change.
- SMS — for major outage components only (avoids spam during yellow events).
- Webhook — JSON POST per state change. Use this to wire into PagerDuty, Opsgenie, or your own tooling.
- RSS / Atom — for status dashboards and aggregators.
Enterprise customers also get incident notifications via their TAM for any P1 that affects their region.
Severity levels
We classify incidents at detection time and may upgrade or downgrade as we learn more.
| Level | Customer impact | Response time |
|---|---|---|
| P1 — Critical | Search unavailable for an entire region or component widely impacted. | 24/7, immediate. |
| P2 — High | Significant degradation but most requests succeed. | 24/7, ≤ 15 minutes. |
| P3 — Medium | Localized issue (one feature, one tenant, one cluster). | Business hours, ≤ 4 h. |
| P4 — Low | Cosmetic, dashboard-only, or affecting non-production behavior. | Business hours, ≤ 1 d. |
These are our internal response targets. Enterprise customers can negotiate tighter contractual response times — see Dedicated cluster.
What we communicate during an incident
A status update typically includes:
- What is affected (component, region).
- When it started (UTC, ISO 8601).
- What we think the cause is, framed against our current uncertainty.
- What we are doing (rollback, scale-up, failover, etc.).
- Next update by (a specific time, even if no new information).
We do not post root cause guesses we're not confident in. If we don't know yet, we say "Investigating — next update at HH:MM UTC." This is by design.
What you should communicate to us during an incident
If you open a P1 or P2 ticket while a public incident is in progress, the most useful information is:
- Your organization ID.
- The index slug if the problem is index-specific.
- The request ID from any failed request in your application logs (header
x-request-id). - The first time you saw the problem (UTC).
- Whether the problem is constant or intermittent, and what proportion of requests are affected.
We will not ask you for things you've already provided. We will ask if you can reproduce while leaving DevTools open — a single HAR sometimes resolves something faster than a 200-line description.
After an incident: post-mortems
For every incident with ≥ 15 minutes of customer impact:
- We publish a public post-incident review at
status.aacsearch.com/incidents/<id>within 14 days. - The PIR includes: timeline, contributing factors, customer impact (numbers, not adjectives), and concrete corrective actions with due dates.
- We re-link the PIR from the original incident on the status page.
We do not name engineers responsible. We do name systems, services, and decisions.
Common mistakes
- Assuming green status = your problem. A green status page means the cluster is healthy. Your problem might still be a misconfigured key or a 429. Use the troubleshooting decision tree before escalating.
- Watching the dashboard during a P1. The dashboard depends on the same auth and database the API does; during a regional outage it may itself be slow. The status page is independent.
- Re-asking for an ETA every 5 minutes. Updates are posted at the cadence we promise. If we said "next update by 14:30 UTC", we will post by 14:30 UTC.
See also
- Monitoring — your own observability, complementary to ours
- Support escalation — what to include in a ticket
- DR recovery runbook — the backstop for catastrophic incidents