Learning to Rank
The LTR pipeline in Relevance Studio — click feedback, position-bias correction, training, model versioning, A/B tests, and activation. How to interpret z-test significance and pick a winner.
Learning to Rank (LTR) is the closed feedback loop that turns real user clicks into a learned ranking model. Studio ships the four panels that map to the four stages of the pipeline: Click feedback, Training runs, Models, and A/B tests.
Pipeline at a glance
clicks (SearchUsageEvent)
│
▼ position-bias correction (debias)
│
▼ training (LightGBM-shim today, native LightGBM/XGBoost deferred)
│
▼ model artifact + metrics (NDCG, MRR, AUC)
│
▼ A/B test: split traffic, z-test significance
│
▼ activate winner → ranking on the read pathEvery stage is reversible. A bad model is one click away from being de-activated — the previous active model is always retained.
1 · Click feedback
The Click feedback panel summarizes the raw signal that feeds the trainer. It surfaces three numbers per index:
- clicks-with-context — how many
clickevents arrived with a validqueryIdandpositionpayload. WithoutqueryIdwe cannot tie a click back to a search. - position-bias correction — the debiasing curve currently in effect.
Position bias is the well-known phenomenon that position 1 gets clicked
regardless of relevance. The corrected score is roughly
observed_ctr / propensity[position], wherepropensityis fit per index. - usable training rows — after debiasing and zero-result filtering, how many rows are eligible to feed the trainer.
If usable training rows are below ~5k, the Training panel will refuse to launch a run and show a "not enough signal" banner. Below this floor the learned model overfits and underperforms the static ranker.
2 · Training runs
A training run produces one model artifact from a chosen window of click feedback. From Studio you choose:
- Window — last 7, 30, or 90 days.
- Algorithm — currently
shim(the only option in shipping releases). Native LightGBM and XGBoost are deferred — see the note below. - Features — pre-defined feature set (text relevance, recency, price, popularity, category cohort). Custom features are on the roadmap.
The run shows progress, then surfaces three metrics on completion:
| Metric | Meaning |
|---|---|
| NDCG@10 | Normalized discounted cumulative gain at top 10 — primary metric |
| MRR | Mean reciprocal rank of the first clicked result |
| AUC | Area under the click vs. non-click curve |
Current shim status (May 2026)
The shipping LTR trainer is a shim algorithm: a simplified linear
combiner that fits a small number of weights over the pre-computed
features. It is intentionally conservative and ships behind the ltr.shim
feature flag.
The full native LightGBM and XGBoost trainers are deferred. They are
implemented behind the ltr.native feature flag but are not enabled for
general availability — they require additional load-testing and a model-
registry rollout. For now, treat the shim as the only supported algorithm,
and expect a 5-15% NDCG@10 lift over the unranked baseline, not the 25-40%
some published LTR papers report. Customers who need native LightGBM today
should contact support to discuss a private preview.
3 · Models
Every successful run produces a model in the Models panel. Models are immutable artifacts and carry:
- a versioned ID (
mdl_<random>), creation timestamp, and source training run, - the metric tuple
(NDCG@10, MRR, AUC), - the feature set and window it was trained on,
- the deployment status:
draft,in_ab_test,active, orarchived.
A model in active state is what the read path consults on every search
request for that index. There is exactly one active model per index at
any time.
4 · A/B tests
You do not promote a model from draft straight to active. Instead you
launch an A/B test that splits live traffic between the candidate model
(arm B) and the currently active model (arm A, the control).
Launching an A/B test
From Studio → LTR → A/B tests → New test:
// What Studio does under the hood — also callable directly
import { client } from "@repo/api/client";
const test = await client.ltr.abTests.create.call({
indexSlug: "products",
controlModelId: "mdl_active_now",
variantModelId: "mdl_candidate",
trafficSplit: 0.10, // 10% of live searches see arm B
primaryMetric: "ctr", // ctr | cvr | ndcg10
minSampleSize: 50_000, // searches per arm before significance is computed
});The traffic split is per search request, not per user — so the same user can land on different arms across sessions. The variant model is applied for arm B; arm A continues to use the existing active model.
Reading the results
While the test runs, the panel shows live counters for each arm:
n— searches assigned to the arm.clicks,ctr,cvr— primary metrics, refreshed every 5 minutes.- z-score — standardized difference between arm B and arm A on the primary metric.
- p-value — two-sided.
- decision —
running,significant_win,significant_loss,±borderline, orno_effect.
How to interpret significance
The trainer uses a standard two-proportion z-test, treating each search as an independent Bernoulli trial for the primary metric. The decision thresholds are:
| Decision | z-score range | What it means |
|---|---|---|
significant_win | z ≥ +1.96 (p ≤ 0.05) | Candidate beats control with 95% confidence. |
significant_loss | z ≤ −1.96 | Candidate loses with 95% confidence — stop the test. |
±borderline | 1.65 ≤ |z| < 1.96 | 90–95% confidence — wait for more samples or pre-decide. |
no_effect | |z| < 1.65 and n ≥ minSampleSize | No detectable difference at the chosen sample size. |
running | n < minSampleSize | Not enough data yet. |
The ±borderline band is intentionally surfaced as a separate decision
because most product teams have a pre-test rule like "ship at 90% if the
delta is positive, hold at 95% if it's negative". Studio does not auto-
promote on borderline — a human must click Activate winner.
Sample-size guidance
The default minSampleSize is 50,000 searches per arm. With a 10%
traffic split, an index doing 50k searches/day will reach significance in
roughly 11 days for an effect size of 2 percentage points on CTR. For
smaller indexes, expect 3–4 weeks. The Studio panel surfaces a live ETA.
Activating a winner
When the decision is significant_win (or you've decided to ship on
±borderline), click Activate winner in the A/B panel. This:
- Sets the variant model's status to
active. - Demotes the previous active model to
archived(it can be re-activated one-click if the new model regresses in production). - Closes the A/B test with a final report.
- Routes 100% of live traffic through the new model within ~60 seconds
(the
policy-cacheLRU TTL).
await client.ltr.abTests.activateWinner.call({
testId: test.id,
// optional safety check — refuses to activate if the live decision
// diverged from `significant_win` between fetch and call
expectedDecision: "significant_win",
});If you need to roll back after activation, the Models panel keeps the
previous active model under archived and exposes a one-click Re-activate.
Best practices
- Always run an A/B test. Even a model that wins on offline NDCG can regress on live CTR — position bias and presentation effects don't show up in the offline metrics.
- Pick the metric that maps to your business. CTR is the default;
conversion-driven catalogs should set
primaryMetric: "cvr". - Don't peek-and-stop. The z-test assumes a fixed sample size — looking
at the result before
minSampleSizeand stopping early inflates the false-positive rate. Set the sample size, walk away, then decide. - Archive aggressively. Keep at most 5–10 archived models per index. The model registry is cheap but the audit log is easier to read with fewer rows.
Related
- LTR API reference — full procedure list:
ltr.training.runs.*,ltr.models.*,ltr.abTests.*. - Personalization — orthogonal layer that composes with LTR on the read path.
- Architecture → Analytics feedback loop — where the click events that feed the trainer originate.
- Plans & Limits — LTR requires the Scale plan or above.
Personalization
User profiles, segments, session reranking, and recommendations in Relevance Studio. How personalize=true + sessionId wire profile-boost into ranking, and best practices for PII.
Getting Started
Start here — create an account, set up your first search index, and connect your first store.