AACsearch
Relevance Studio

Learning to Rank

The LTR pipeline in Relevance Studio — click feedback, position-bias correction, training, model versioning, A/B tests, and activation. How to interpret z-test significance and pick a winner.

Learning to Rank (LTR) is the closed feedback loop that turns real user clicks into a learned ranking model. Studio ships the four panels that map to the four stages of the pipeline: Click feedback, Training runs, Models, and A/B tests.

Pipeline at a glance

clicks (SearchUsageEvent)

   ▼   position-bias correction (debias)

   ▼   training (LightGBM-shim today, native LightGBM/XGBoost deferred)

   ▼   model artifact + metrics (NDCG, MRR, AUC)

   ▼   A/B test: split traffic, z-test significance

   ▼   activate winner → ranking on the read path

Every stage is reversible. A bad model is one click away from being de-activated — the previous active model is always retained.

1 · Click feedback

The Click feedback panel summarizes the raw signal that feeds the trainer. It surfaces three numbers per index:

  • clicks-with-context — how many click events arrived with a valid queryId and position payload. Without queryId we cannot tie a click back to a search.
  • position-bias correction — the debiasing curve currently in effect. Position bias is the well-known phenomenon that position 1 gets clicked regardless of relevance. The corrected score is roughly observed_ctr / propensity[position], where propensity is fit per index.
  • usable training rows — after debiasing and zero-result filtering, how many rows are eligible to feed the trainer.

If usable training rows are below ~5k, the Training panel will refuse to launch a run and show a "not enough signal" banner. Below this floor the learned model overfits and underperforms the static ranker.

2 · Training runs

A training run produces one model artifact from a chosen window of click feedback. From Studio you choose:

  • Window — last 7, 30, or 90 days.
  • Algorithm — currently shim (the only option in shipping releases). Native LightGBM and XGBoost are deferred — see the note below.
  • Features — pre-defined feature set (text relevance, recency, price, popularity, category cohort). Custom features are on the roadmap.

The run shows progress, then surfaces three metrics on completion:

MetricMeaning
NDCG@10Normalized discounted cumulative gain at top 10 — primary metric
MRRMean reciprocal rank of the first clicked result
AUCArea under the click vs. non-click curve

Current shim status (May 2026)

The shipping LTR trainer is a shim algorithm: a simplified linear combiner that fits a small number of weights over the pre-computed features. It is intentionally conservative and ships behind the ltr.shim feature flag.

The full native LightGBM and XGBoost trainers are deferred. They are implemented behind the ltr.native feature flag but are not enabled for general availability — they require additional load-testing and a model- registry rollout. For now, treat the shim as the only supported algorithm, and expect a 5-15% NDCG@10 lift over the unranked baseline, not the 25-40% some published LTR papers report. Customers who need native LightGBM today should contact support to discuss a private preview.

3 · Models

Every successful run produces a model in the Models panel. Models are immutable artifacts and carry:

  • a versioned ID (mdl_<random>), creation timestamp, and source training run,
  • the metric tuple (NDCG@10, MRR, AUC),
  • the feature set and window it was trained on,
  • the deployment status: draft, in_ab_test, active, or archived.

A model in active state is what the read path consults on every search request for that index. There is exactly one active model per index at any time.

4 · A/B tests

You do not promote a model from draft straight to active. Instead you launch an A/B test that splits live traffic between the candidate model (arm B) and the currently active model (arm A, the control).

Launching an A/B test

From Studio → LTR → A/B tests → New test:

// What Studio does under the hood — also callable directly
import { client } from "@repo/api/client";

const test = await client.ltr.abTests.create.call({
  indexSlug: "products",
  controlModelId: "mdl_active_now",
  variantModelId: "mdl_candidate",
  trafficSplit: 0.10, // 10% of live searches see arm B
  primaryMetric: "ctr", // ctr | cvr | ndcg10
  minSampleSize: 50_000, // searches per arm before significance is computed
});

The traffic split is per search request, not per user — so the same user can land on different arms across sessions. The variant model is applied for arm B; arm A continues to use the existing active model.

Reading the results

While the test runs, the panel shows live counters for each arm:

  • n — searches assigned to the arm.
  • clicks, ctr, cvr — primary metrics, refreshed every 5 minutes.
  • z-score — standardized difference between arm B and arm A on the primary metric.
  • p-value — two-sided.
  • decisionrunning, significant_win, significant_loss, ±borderline, or no_effect.

How to interpret significance

The trainer uses a standard two-proportion z-test, treating each search as an independent Bernoulli trial for the primary metric. The decision thresholds are:

Decisionz-score rangeWhat it means
significant_winz ≥ +1.96 (p ≤ 0.05)Candidate beats control with 95% confidence.
significant_lossz ≤ −1.96Candidate loses with 95% confidence — stop the test.
±borderline1.65 ≤ |z| < 1.9690–95% confidence — wait for more samples or pre-decide.
no_effect|z| < 1.65 and n ≥ minSampleSizeNo detectable difference at the chosen sample size.
runningn < minSampleSizeNot enough data yet.

The ±borderline band is intentionally surfaced as a separate decision because most product teams have a pre-test rule like "ship at 90% if the delta is positive, hold at 95% if it's negative". Studio does not auto- promote on borderline — a human must click Activate winner.

Sample-size guidance

The default minSampleSize is 50,000 searches per arm. With a 10% traffic split, an index doing 50k searches/day will reach significance in roughly 11 days for an effect size of 2 percentage points on CTR. For smaller indexes, expect 3–4 weeks. The Studio panel surfaces a live ETA.

Activating a winner

When the decision is significant_win (or you've decided to ship on ±borderline), click Activate winner in the A/B panel. This:

  1. Sets the variant model's status to active.
  2. Demotes the previous active model to archived (it can be re-activated one-click if the new model regresses in production).
  3. Closes the A/B test with a final report.
  4. Routes 100% of live traffic through the new model within ~60 seconds (the policy-cache LRU TTL).
await client.ltr.abTests.activateWinner.call({
  testId: test.id,
  // optional safety check — refuses to activate if the live decision
  // diverged from `significant_win` between fetch and call
  expectedDecision: "significant_win",
});

If you need to roll back after activation, the Models panel keeps the previous active model under archived and exposes a one-click Re-activate.

Best practices

  • Always run an A/B test. Even a model that wins on offline NDCG can regress on live CTR — position bias and presentation effects don't show up in the offline metrics.
  • Pick the metric that maps to your business. CTR is the default; conversion-driven catalogs should set primaryMetric: "cvr".
  • Don't peek-and-stop. The z-test assumes a fixed sample size — looking at the result before minSampleSize and stopping early inflates the false-positive rate. Set the sample size, walk away, then decide.
  • Archive aggressively. Keep at most 5–10 archived models per index. The model registry is cheap but the audit log is easier to read with fewer rows.

On this page