AACsearch
Operations & Reliability

Performance, SLA, caching & scaling

How AACsearch scales, where the cache layers sit, what SLA each plan tier carries, and the levers you have when latency or throughput becomes the bottleneck.

This page is the engineering reference for how fast AACsearch is, why, and what you can change. For the customer-facing plan matrix see Plans & Limits; for 429s specifically see Operations → Rate limits.

SLA by plan tier

PlanRead p95 targetWrite ack p95UptimeSupport response
Freebest-effortbest-effort99.0%Community
Starter250 ms500 ms99.5%Email, 1 BD
Pro150 ms300 ms99.9%Priority, 4 h
Business100 ms200 ms99.9%Dedicated, 1 h
EnterpriseCustomCustom99.95%SLA contract

Read p95 is the time from POST /search reaching the API gateway to a 200 response leaving it, measured at the org level for the trailing 30 days. Write ack is the time from a write request being accepted to the write being visible in subsequent reads (alias-swap latency for reindexes is not counted — see Reindexing).

Targets are SLOs, not contractual SLAs, except for Enterprise where the SLA contract supersedes this table.

Read path latency budget

A POST /search request flows through five hops. The default budget per hop:

HopBudgetCode path
API gateway + auth< 5 mspackages/api/modules/search/lib/access.ts token verification
Tenant filter compilation< 2 mspackages/search/lib/search.ts builds the scoped filter_by
Policy cache lookup< 1 mspackages/search/lib/policy-cache.ts (TTL 60 s, see below)
Typesense multi_search30–100 msDepends on collection size, vector dims, and per_page
Response shaping + log< 5 msIncludes analytics-event emission (fire-and-forget)

Total p95 target on Pro: 150 ms for a 100-document per_page=10 query against a 100k-document collection.

Cache layers

AACsearch has three cache layers, each with a different purpose and TTL:

1. Policy cache (server-side)

  • What: Resolved org plan, feature flags, scoped-token constraints.
  • Where: packages/search/lib/policy-cache.ts.
  • TTL: 60 seconds, in-process LRU per API instance.
  • Why: Plan/entitlement resolution requires 2-3 DB lookups; caching them avoids hitting Postgres on every request.
  • Invalidation: Implicit (TTL). A plan upgrade takes effect within 60 s. For instant invalidation (e.g. quota raised on customer request), restart the API container.

2. InstantSearch adapter cache (browser-side)

  • What: Identical queries within a short window return the previous response.
  • Where: packages/instantsearch-adapter/src/cache.ts.
  • TTL: Configurable, default disabled (the adapter ships off so the customer chooses).
  • Why: Useful when a user toggles a facet back to a previous value — saves a round-trip.
  • Invalidation: Manual via client.clearCache().

3. CDN cache (edge, opt-in)

  • What: GET endpoints (collection schema, public widget config) can be cached at the edge if you front AACsearch with a CDN.
  • TTL: Set via Cache-Control headers AACsearch emits on GET. POST /search is intentionally not cacheable.
  • Why: Reduces hot-path traffic for read-only metadata.

Do not cache POST /search responses at the edge. The response body is tenant-scoped via the bearer token; an edge cache keyed on URL alone will leak data across tenants.

Throughput & scaling

Per-API-key rate limits

The rateLimitPerMinute column on SearchApiKey (default 600) is enforced per key, per minute, sliding window in packages/api/modules/search/lib/rate-limit.ts. Plan tier raises the maximum but not the default — you set per-key limits explicitly.

PlanMax rateLimitPerMinute per key
Free60
Starter300
Pro1,200
Business6,000
EnterpriseCustom

If one widget shares a key across hundreds of browsers, the per-key cap is the wrong abstraction. Issue multiple keys (one per environment, region, or major client) instead of asking for the cap to be raised — the cap exists to contain runaway clients.

Org-level monthly quota

Independent from per-minute rate limits, every org has a monthly Search Unit quota (maxSearchesPerMonth). One search OR one document write = one Search Unit. See Plans & Limits → Search Units.

The quota uses two enforcement modes:

  • Soft cap (default for Free/Starter): 80% triggers a warning, 100% returns quota_exceeded 429 with grace-read window (24 h) before writes also start failing.
  • Hard cap: Configurable per org. Writes fail at 100%; reads continue (so existing widgets keep working) until a fixed grace window expires.

The grace mechanics live in packages/payments/lib/entitlements.ts. The dashboard surfaces both states under Settings → Billing.

Typesense cluster

Each org's collections live in a shared Typesense cluster on Free/Starter/Pro. Business and Enterprise can opt into a dedicated cluster (see Enterprise → Dedicated cluster).

Shared-cluster scaling characteristics:

MetricShared cluster (Pro)Dedicated cluster (Enterprise)
Collections per clusterUp to ~5,000Customer-tuned
Documents per collectionTested to 5M; harder aboveSharded above 10M
Vector dim limit1,536 (OpenAI ada / Cohere v3)Up to 4,096
Concurrent reindex jobs2 per orgCustomer-tuned

Above the shared-cluster ceiling, the alias-swap reindex pattern (see Reindexing) starts noticeably contending for resources with other tenants. Dedicated cluster is the recommended path past 5 M docs per collection or sustained > 1k QPS.

Postgres

AACsearch's source of truth is Postgres (packages/database schema). For latency-sensitive paths the API never reaches Postgres on the hot search path — it goes through Typesense and the policy cache. Postgres is hit for:

  • Plan/entitlement resolution (cached, see policy cache).
  • Audit log writes (fire-and-forget — never blocks the response).
  • Reindex orchestration (SearchSyncOutbox).
  • Quota counting (SearchUsageEvent, batched).

Postgres connection pooling is configured per app (apps/saas, apps/marketing, packages/api). Default pool size is 20 per replica. Above ~50 API replicas, switch to PgBouncer in transaction-pooling mode to avoid pool exhaustion.

Observability

What to watch when performance regresses:

SignalWhere
p50 / p95 / p99 search latencyOperations → Observability
429 rate (rate-limit + quota)Dashboard → Analytics → Errors
Reindex lag (ingest → searchable)Dashboard → Indexes → Reindex history
Typesense memory / CPUCoolify / Grafana (shared cluster: ops-team only)
Postgres connection saturationCoolify / Grafana

Detailed runbooks live in Operations → Monitoring and Operations → Troubleshooting.

When to scale up

Trigger conditions and the recommended action:

SymptomLikely causeAction
p95 search latency creeps above target for a tierCollection size approaching shared-cluster ceilingPlan upgrade or move to dedicated cluster
429 rate_limit_exceeded from a single keyFrontend fires one search per keystrokeDebounce the client (200 ms); see Rate limits
429 quota_exceeded consistently before month endSustained growth past tier monthly capPlan upgrade, or set a higher hard cap on Business+
Reindex jobs queue upMultiple reindexes triggered concurrentlySequentialize at the application layer; alias-swap is one-at-a-time per index
Vector search noticeably slower than text-onlyVector dims close to cluster ceilingReduce dims (e.g. 1536 → 768) or move to dedicated

Performance smoke tests

The repo ships a basic load harness in packages/loadtest. Run it against staging to validate latency targets before a public launch:

cd packages/loadtest && bun run smoke

Default profile: 10 concurrent virtual users × 60 s × POST /search against the _demo collection. Output is p50/p95/p99 + error rate.

Do not run load tests against app.aacsearch.com without coordinating with ops — the per-key rate limit will kick in and the run will be measuring 429 throughput, not search throughput.

Further reading

On this page