Performance, SLA, caching & scaling

How AACsearch scales, where the cache layers sit, what SLA each plan tier carries, and the levers you have when latency or throughput becomes the bottleneck.

This page is the engineering reference for how fast AACsearch is, why, and what you can change. For the customer-facing plan matrix see Plans & Limits; for 429s specifically see Operations → Rate limits.

SLA by plan tier

Plan	Read p95 target	Write ack p95	Uptime	Support response
Free	best-effort	best-effort	99.0%	Community
Starter	250 ms	500 ms	99.5%	Email, 1 BD
Pro	150 ms	300 ms	99.9%	Priority, 4 h
Business	100 ms	200 ms	99.9%	Dedicated, 1 h
Enterprise	Custom	Custom	99.95%	SLA contract

Read p95 is the time from POST /search reaching the API gateway to a 200 response leaving it, measured at the org level for the trailing 30 days. Write ack is the time from a write request being accepted to the write being visible in subsequent reads (alias-swap latency for reindexes is not counted — see Reindexing).

Targets are SLOs, not contractual SLAs, except for Enterprise where the SLA contract supersedes this table.

Read path latency budget

A POST /search request flows through five hops. The default budget per hop:

Hop	Budget	Code path
API gateway + auth	< 5 ms	`packages/api/modules/search/lib/access.ts` token verification
Tenant filter compilation	< 2 ms	`packages/search/lib/search.ts` builds the scoped `filter_by`
Policy cache lookup	< 1 ms	`packages/search/lib/policy-cache.ts` (TTL 60 s, see below)
Typesense `multi_search`	30–100 ms	Depends on collection size, vector dims, and `per_page`
Response shaping + log	< 5 ms	Includes analytics-event emission (fire-and-forget)

Total p95 target on Pro: 150 ms for a 100-document per_page=10 query against a 100k-document collection.

Cache layers

AACsearch has three cache layers, each with a different purpose and TTL:

1. Policy cache (server-side)

What: Resolved org plan, feature flags, scoped-token constraints.
Where: packages/search/lib/policy-cache.ts.
TTL: 60 seconds, in-process LRU per API instance.
Why: Plan/entitlement resolution requires 2-3 DB lookups; caching them avoids hitting Postgres on every request.
Invalidation: Implicit (TTL). A plan upgrade takes effect within 60 s. For instant invalidation (e.g. quota raised on customer request), restart the API container.

2. InstantSearch adapter cache (browser-side)

What: Identical queries within a short window return the previous response.
Where: packages/instantsearch-adapter/src/cache.ts.
TTL: Configurable, default disabled (the adapter ships off so the customer chooses).
Why: Useful when a user toggles a facet back to a previous value — saves a round-trip.
Invalidation: Manual via client.clearCache().

3. CDN cache (edge, opt-in)

What: GET endpoints (collection schema, public widget config) can be cached at the edge if you front AACsearch with a CDN.
TTL: Set via Cache-Control headers AACsearch emits on GET. POST /search is intentionally not cacheable.
Why: Reduces hot-path traffic for read-only metadata.

Do not cache POST /search responses at the edge. The response body is tenant-scoped via the bearer token; an edge cache keyed on URL alone will leak data across tenants.

Throughput & scaling

Per-API-key rate limits

The rateLimitPerMinute column on SearchApiKey (default 600) is enforced per key, per minute, sliding window in packages/api/modules/search/lib/rate-limit.ts. Plan tier raises the maximum but not the default — you set per-key limits explicitly.

Plan	Max rateLimitPerMinute per key
Free	60
Starter	300
Pro	1,200
Business	6,000
Enterprise	Custom

If one widget shares a key across hundreds of browsers, the per-key cap is the wrong abstraction. Issue multiple keys (one per environment, region, or major client) instead of asking for the cap to be raised — the cap exists to contain runaway clients.

Org-level monthly quota

Independent from per-minute rate limits, every org has a monthly Search Unit quota (maxSearchesPerMonth). One search OR one document write = one Search Unit. See Plans & Limits → Search Units.

The quota uses two enforcement modes:

Soft cap (default for Free/Starter): 80% triggers a warning, 100% returns quota_exceeded 429 with grace-read window (24 h) before writes also start failing.
Hard cap: Configurable per org. Writes fail at 100%; reads continue (so existing widgets keep working) until a fixed grace window expires.

The grace mechanics live in packages/payments/lib/entitlements.ts. The dashboard surfaces both states under Settings → Billing.

Typesense cluster

Each org's collections live in a shared Typesense cluster on Free/Starter/Pro. Business and Enterprise can opt into a dedicated cluster (see Enterprise → Dedicated cluster).

Shared-cluster scaling characteristics:

Metric	Shared cluster (Pro)	Dedicated cluster (Enterprise)
Collections per cluster	Up to ~5,000	Customer-tuned
Documents per collection	Tested to 5M; harder above	Sharded above 10M
Vector dim limit	1,536 (OpenAI ada / Cohere v3)	Up to 4,096
Concurrent reindex jobs	2 per org	Customer-tuned

Above the shared-cluster ceiling, the alias-swap reindex pattern (see Reindexing) starts noticeably contending for resources with other tenants. Dedicated cluster is the recommended path past 5 M docs per collection or sustained > 1k QPS.

Postgres

AACsearch's source of truth is Postgres (packages/database schema). For latency-sensitive paths the API never reaches Postgres on the hot search path — it goes through Typesense and the policy cache. Postgres is hit for:

Plan/entitlement resolution (cached, see policy cache).
Audit log writes (fire-and-forget — never blocks the response).
Reindex orchestration (SearchSyncOutbox).
Quota counting (SearchUsageEvent, batched).

Postgres connection pooling is configured per app (apps/saas, apps/marketing, packages/api). Default pool size is 20 per replica. Above ~50 API replicas, switch to PgBouncer in transaction-pooling mode to avoid pool exhaustion.

Observability

What to watch when performance regresses:

Signal	Where
p50 / p95 / p99 search latency	Operations → Observability
429 rate (rate-limit + quota)	Dashboard → Analytics → Errors
Reindex lag (ingest → searchable)	Dashboard → Indexes → Reindex history
Typesense memory / CPU	Coolify / Grafana (shared cluster: ops-team only)
Postgres connection saturation	Coolify / Grafana

Detailed runbooks live in Operations → Monitoring and Operations → Troubleshooting.

When to scale up

Trigger conditions and the recommended action:

Symptom	Likely cause	Action
p95 search latency creeps above target for a tier	Collection size approaching shared-cluster ceiling	Plan upgrade or move to dedicated cluster
429 `rate_limit_exceeded` from a single key	Frontend fires one search per keystroke	Debounce the client (200 ms); see Rate limits
429 `quota_exceeded` consistently before month end	Sustained growth past tier monthly cap	Plan upgrade, or set a higher hard cap on Business+
Reindex jobs queue up	Multiple reindexes triggered concurrently	Sequentialize at the application layer; alias-swap is one-at-a-time per index
Vector search noticeably slower than text-only	Vector dims close to cluster ceiling	Reduce dims (e.g. 1536 → 768) or move to dedicated

Performance smoke tests

The repo ships a basic load harness in packages/loadtest. Run it against staging to validate latency targets before a public launch:

cd packages/loadtest && bun run smoke

Default profile: 10 concurrent virtual users × 60 s × POST /search against the _demo collection. Output is p50/p95/p99 + error rate.

Do not run load tests against app.aacsearch.com without coordinating with ops — the per-key rate limit will kick in and the run will be measuring 429 throughput, not search throughput.