Performance, SLA, caching & scaling
How AACsearch scales, where the cache layers sit, what SLA each plan tier carries, and the levers you have when latency or throughput becomes the bottleneck.
This page is the engineering reference for how fast AACsearch is, why, and what you can change. For the customer-facing plan matrix see Plans & Limits; for 429s specifically see Operations → Rate limits.
SLA by plan tier
| Plan | Read p95 target | Write ack p95 | Uptime | Support response |
|---|---|---|---|---|
| Free | best-effort | best-effort | 99.0% | Community |
| Starter | 250 ms | 500 ms | 99.5% | Email, 1 BD |
| Pro | 150 ms | 300 ms | 99.9% | Priority, 4 h |
| Business | 100 ms | 200 ms | 99.9% | Dedicated, 1 h |
| Enterprise | Custom | Custom | 99.95% | SLA contract |
Read p95 is the time from POST /search reaching the API gateway to a 200 response leaving it, measured at the org level for the trailing 30 days. Write ack is the time from a write request being accepted to the write being visible in subsequent reads (alias-swap latency for reindexes is not counted — see Reindexing).
Targets are SLOs, not contractual SLAs, except for Enterprise where the SLA contract supersedes this table.
Read path latency budget
A POST /search request flows through five hops. The default budget per hop:
| Hop | Budget | Code path |
|---|---|---|
| API gateway + auth | < 5 ms | packages/api/modules/search/lib/access.ts token verification |
| Tenant filter compilation | < 2 ms | packages/search/lib/search.ts builds the scoped filter_by |
| Policy cache lookup | < 1 ms | packages/search/lib/policy-cache.ts (TTL 60 s, see below) |
Typesense multi_search | 30–100 ms | Depends on collection size, vector dims, and per_page |
| Response shaping + log | < 5 ms | Includes analytics-event emission (fire-and-forget) |
Total p95 target on Pro: 150 ms for a 100-document per_page=10 query against a 100k-document collection.
Cache layers
AACsearch has three cache layers, each with a different purpose and TTL:
1. Policy cache (server-side)
- What: Resolved org plan, feature flags, scoped-token constraints.
- Where:
packages/search/lib/policy-cache.ts. - TTL: 60 seconds, in-process LRU per API instance.
- Why: Plan/entitlement resolution requires 2-3 DB lookups; caching them avoids hitting Postgres on every request.
- Invalidation: Implicit (TTL). A plan upgrade takes effect within 60 s. For instant invalidation (e.g. quota raised on customer request), restart the API container.
2. InstantSearch adapter cache (browser-side)
- What: Identical queries within a short window return the previous response.
- Where:
packages/instantsearch-adapter/src/cache.ts. - TTL: Configurable, default disabled (the adapter ships off so the customer chooses).
- Why: Useful when a user toggles a facet back to a previous value — saves a round-trip.
- Invalidation: Manual via
client.clearCache().
3. CDN cache (edge, opt-in)
- What:
GETendpoints (collection schema, public widget config) can be cached at the edge if you front AACsearch with a CDN. - TTL: Set via
Cache-Controlheaders AACsearch emits onGET.POST /searchis intentionally not cacheable. - Why: Reduces hot-path traffic for read-only metadata.
Do not cache POST /search responses at the edge. The response body is tenant-scoped via the bearer token; an edge cache keyed on URL alone will leak data across tenants.
Throughput & scaling
Per-API-key rate limits
The rateLimitPerMinute column on SearchApiKey (default 600) is enforced per key, per minute, sliding window in packages/api/modules/search/lib/rate-limit.ts. Plan tier raises the maximum but not the default — you set per-key limits explicitly.
| Plan | Max rateLimitPerMinute per key |
|---|---|
| Free | 60 |
| Starter | 300 |
| Pro | 1,200 |
| Business | 6,000 |
| Enterprise | Custom |
If one widget shares a key across hundreds of browsers, the per-key cap is the wrong abstraction. Issue multiple keys (one per environment, region, or major client) instead of asking for the cap to be raised — the cap exists to contain runaway clients.
Org-level monthly quota
Independent from per-minute rate limits, every org has a monthly Search Unit quota (maxSearchesPerMonth). One search OR one document write = one Search Unit. See Plans & Limits → Search Units.
The quota uses two enforcement modes:
- Soft cap (default for Free/Starter): 80% triggers a warning, 100% returns
quota_exceeded429 with grace-read window (24 h) before writes also start failing. - Hard cap: Configurable per org. Writes fail at 100%; reads continue (so existing widgets keep working) until a fixed grace window expires.
The grace mechanics live in packages/payments/lib/entitlements.ts. The dashboard surfaces both states under Settings → Billing.
Typesense cluster
Each org's collections live in a shared Typesense cluster on Free/Starter/Pro. Business and Enterprise can opt into a dedicated cluster (see Enterprise → Dedicated cluster).
Shared-cluster scaling characteristics:
| Metric | Shared cluster (Pro) | Dedicated cluster (Enterprise) |
|---|---|---|
| Collections per cluster | Up to ~5,000 | Customer-tuned |
| Documents per collection | Tested to 5M; harder above | Sharded above 10M |
| Vector dim limit | 1,536 (OpenAI ada / Cohere v3) | Up to 4,096 |
| Concurrent reindex jobs | 2 per org | Customer-tuned |
Above the shared-cluster ceiling, the alias-swap reindex pattern (see Reindexing) starts noticeably contending for resources with other tenants. Dedicated cluster is the recommended path past 5 M docs per collection or sustained > 1k QPS.
Postgres
AACsearch's source of truth is Postgres (packages/database schema). For latency-sensitive paths the API never reaches Postgres on the hot search path — it goes through Typesense and the policy cache. Postgres is hit for:
- Plan/entitlement resolution (cached, see policy cache).
- Audit log writes (fire-and-forget — never blocks the response).
- Reindex orchestration (
SearchSyncOutbox). - Quota counting (
SearchUsageEvent, batched).
Postgres connection pooling is configured per app (apps/saas, apps/marketing, packages/api). Default pool size is 20 per replica. Above ~50 API replicas, switch to PgBouncer in transaction-pooling mode to avoid pool exhaustion.
Observability
What to watch when performance regresses:
| Signal | Where |
|---|---|
| p50 / p95 / p99 search latency | Operations → Observability |
| 429 rate (rate-limit + quota) | Dashboard → Analytics → Errors |
| Reindex lag (ingest → searchable) | Dashboard → Indexes → Reindex history |
| Typesense memory / CPU | Coolify / Grafana (shared cluster: ops-team only) |
| Postgres connection saturation | Coolify / Grafana |
Detailed runbooks live in Operations → Monitoring and Operations → Troubleshooting.
When to scale up
Trigger conditions and the recommended action:
| Symptom | Likely cause | Action |
|---|---|---|
| p95 search latency creeps above target for a tier | Collection size approaching shared-cluster ceiling | Plan upgrade or move to dedicated cluster |
429 rate_limit_exceeded from a single key | Frontend fires one search per keystroke | Debounce the client (200 ms); see Rate limits |
429 quota_exceeded consistently before month end | Sustained growth past tier monthly cap | Plan upgrade, or set a higher hard cap on Business+ |
| Reindex jobs queue up | Multiple reindexes triggered concurrently | Sequentialize at the application layer; alias-swap is one-at-a-time per index |
| Vector search noticeably slower than text-only | Vector dims close to cluster ceiling | Reduce dims (e.g. 1536 → 768) or move to dedicated |
Performance smoke tests
The repo ships a basic load harness in packages/loadtest. Run it against staging to validate latency targets before a public launch:
cd packages/loadtest && bun run smokeDefault profile: 10 concurrent virtual users × 60 s × POST /search against the _demo collection. Output is p50/p95/p99 + error rate.
Do not run load tests against app.aacsearch.com without coordinating with ops — the per-key rate limit will kick in and the run will be measuring 429 throughput, not search throughput.
Further reading
- Plans & Limits — canonical quota matrix
- Rate limits — 429 diagnosis flow
- Reindexing — alias-swap behaviour
- Observability — metrics, traces, logs
- Enterprise → Dedicated cluster — when to leave shared infra