Knowledge Sources
Supported source types, ingestion lifecycle, chunking strategies, and sync behavior for the Knowledge module.
A data source describes where a Knowledge space's documents come from. The Knowledge module currently supports two kinds out of the box — file upload and URL fetch — with connector-driven sources (Confluence, Notion, Google Drive) on the roadmap.
Source types
sourceType | Status | What it ingests | Sync behavior |
|---|---|---|---|
FILE_UPLOAD | ✅ Available | PDF, DOCX, TXT, Markdown | One-shot — re-upload to refresh |
URL_FETCH | ✅ Available | A single web page (HTML → text) | One-shot — re-ingest to refresh |
CONFLUENCE | ⏳ Roadmap | Confluence pages (per space / per label) | Scheduled, incremental |
NOTION | ⏳ Roadmap | Notion pages and databases | Scheduled, incremental |
GOOGLE_DRIVE | ⏳ Roadmap | Drive folders (docs, slides, sheets) | Scheduled, incremental |
SHAREPOINT | ⏳ Roadmap | SharePoint document libraries | Scheduled, incremental |
The full enum lives in packages/database/prisma/schema.prisma as KnowledgeSourceType. Roadmap entries are placeholders — they don't yet have a DataSource row, parser, or sync worker.
Supported file formats (FILE_UPLOAD)
| Format | Status | Notes |
|---|---|---|
application/pdf | ✅ | Extracted with the pdf-parse adapter in packages/api/modules/knowledge/lib/parsers.ts |
application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX) | ✅ | mammoth adapter |
text/plain (TXT) | ✅ | UTF-8 expected. Latin-1 and Windows-1251 fall back to a heuristic decode. |
text/markdown | ✅ | Markdown is split chunk-aware (see strategies below). |
application/msword (legacy .doc) | ❌ | Convert to DOCX first. |
application/rtf | ❌ | Convert to TXT / DOCX first. |
| Scanned PDF (image-only) | ❌ | OCR is roadmap. PDFs must have a text layer. |
File-size cap depends on the org plan (MAX_KNOWLEDGE_FILE_SIZE_BYTES entitlement). Defaults are 25 MB for paid plans, 5 MB for free.
Ingestion lifecycle
DataSource ──► IngestionJob (QUEUED)
│
▼
IngestionJob (RUNNING)
│
parse + chunk + embed
│
┌───────────┴───────────┐
▼ ▼
KnowledgeDocument (graph build)
│
▼
KnowledgeChunk × N (+ embedding)
│
▼
IngestionJob (SUCCEEDED / FAILED)Each IngestionJob row carries:
status—QUEUED → RUNNING → SUCCEEDEDorFAILED.totalItems— the parser's estimate of the work unit count (pages / sections).processedItems,failedItems— running totals visible during execution.errorMessage— populated on failure. Surfaced in the dashboard's Ingestion Jobs tab.
You can list jobs with knowledge.listIngestionJobs and inspect a single job's inputMeta (the original file name, mime type, bytes) to diagnose failures.
Retry semantics
Failed jobs are not auto-retried. The recommended pattern:
- Inspect
errorMessagein the dashboard. - Fix the source (re-encode the PDF, expand the URL, swap a corrupt DOCX) or the env (LLM quota exhausted).
- Re-run by re-uploading or calling
ingestFile/ingestUrlagain. The second run creates a newIngestionJoband re-uses the existingDataSource.
Re-ingest replaces the previous KnowledgeDocument (matched by (knowledgeSpaceId, externalId)) and triggers a chunk + embedding refresh. Old chunks are deleted via the documentId cascade.
Chunking strategies
Chunking is implemented in packages/api/modules/knowledge/lib/chunking.ts. The strategy decides how the parsed text is sliced before embedding:
| Strategy | Best for | How it splits |
|---|---|---|
fixed | Default. Mixed-format documents. | Fixed word count per chunk with configurable overlap. |
semantic | Long prose, articles. | Splits on paragraph + sentence boundaries; falls back to fixed when the segment is too long. |
markdown | Markdown / Confluence-style docs. | Splits on heading levels. Keeps each section together when possible. |
code | Source code documentation, API references. | Splits on function / class boundaries. Aware of triple-fenced code blocks. |
Configurable per call (or per space, Beta):
{
strategy: "semantic",
chunkSize: 350, // target words per chunk
minChunkSize: 60, // merge anything smaller than this
maxChunkSize: 600,
overlap: 50, // overlapping words between adjacent chunks
}Rule of thumb: 250–400 words per chunk with 30–50 words of overlap covers most prose. Larger chunks reduce retrieval count but blunt the relevance signal; smaller chunks improve relevance but inflate embedding cost.
Embedding
Each chunk's embedding is a Json array of floats sized to the configured model:
| Model | num_dim | Status |
|---|---|---|
text-embedding-3-small | 1536 | ✅ Default |
text-embedding-3-large | 3072 | 🟡 Beta — per-space selection |
| Local hashing fallback | 128 | ✅ Available — used when no API key is configured (dev mode only) |
KnowledgeSpace.ragConfig.embeddingModel overrides the default per space. The local 128-dim fallback (embedTextLocally in chunking.ts) is dev-only — it does not produce useful retrieval; configure a real model before going to production.
Switching the embedding model invalidates all existing chunk embeddings. After changing it, trigger a full re-ingest of every source in the space.
Ingesting a file
const job = await orpc.knowledge.ingestFile.call({
spaceId: "ks_…",
fileName: "handbook.pdf",
mimeType: "application/pdf",
bytes: pdfArrayBuffer,
});
// → { ingestionJobId, dataSourceId }Poll knowledge.listIngestionJobs for status, or subscribe to the dashboard SSE channel.
Ingesting a URL
await orpc.knowledge.ingestUrl.call({
spaceId: "ks_…",
url: "https://example.com/docs/payments",
// optional:
cssSelector: "main article", // narrow what to extract
});Public, HTML-only URLs only — no JS execution, no auth headers, no SPA-rendered content. Pages behind login are roadmap (connectors).
Deleting sources and files
| Action | API | Effect |
|---|---|---|
| Delete a single file/document | knowledge.deleteFile({ documentId }) | Cascades to all its KnowledgeChunk rows. |
| Delete a data source | knowledge.deleteSource({ dataSourceId }) | Cascades to all KnowledgeDocument and chunks under it. |
| Delete a space | knowledge.deleteSpace({ spaceId }) | Cascades to all sources, documents, chunks, graph nodes, edges. |
Cascades are at the Postgres FK level (see onDelete: Cascade in schema.prisma) — deletion is durable and irreversible. Confirm before exposing in your UI.
Quotas
Per-org quotas come from the entitlements package:
| Quota | Field on plan |
|---|---|
| Max spaces per org | entitlements.knowledge.maxSpaces |
| Max documents per space | entitlements.knowledge.maxDocsPerSpace |
| Max bytes per file | entitlements.knowledge.maxBytesPerFile |
| Monthly embedding budget | entitlements.knowledge.monthlyEmbeddingKopecks |
Quota exhaustion returns 402 Payment Required and emits a dashboard quota_warning activity event at 80 % of the monthly budget.
Related pages
- Knowledge RAG overview
- Evaluation — measuring quality after ingest
- GraphRAG entity model — what
buildGraphFromChunkswrites - Plans and limits
- Knowledge module & admin — dashboard view
Knowledge RAG Overview
Retrieval-augmented Q&A over your own documents — uploaded files, URLs, internal knowledge bases. How spaces, sources, ingestion jobs, and the ask endpoint fit together.
Evaluating Knowledge RAG Quality
A pragmatic workflow for measuring and improving answer quality — retrieval, faithfulness, coverage — without a research lab.