Knowledge Sources

Supported source types, ingestion lifecycle, chunking strategies, and sync behavior for the Knowledge module.

A data source describes where a Knowledge space's documents come from. The Knowledge module currently supports two kinds out of the box — file upload and URL fetch — with connector-driven sources (Confluence, Notion, Google Drive) on the roadmap.

Source types

`sourceType`	Status	What it ingests	Sync behavior
`FILE_UPLOAD`	✅ Available	PDF, DOCX, TXT, Markdown	One-shot — re-upload to refresh
`URL_FETCH`	✅ Available	A single web page (HTML → text)	One-shot — re-ingest to refresh
`CONFLUENCE`	⏳ Roadmap	Confluence pages (per space / per label)	Scheduled, incremental
`NOTION`	⏳ Roadmap	Notion pages and databases	Scheduled, incremental
`GOOGLE_DRIVE`	⏳ Roadmap	Drive folders (docs, slides, sheets)	Scheduled, incremental
`SHAREPOINT`	⏳ Roadmap	SharePoint document libraries	Scheduled, incremental

The full enum lives in packages/database/prisma/schema.prisma as KnowledgeSourceType. Roadmap entries are placeholders — they don't yet have a DataSource row, parser, or sync worker.

Supported file formats (FILE_UPLOAD)

Format	Status	Notes
`application/pdf`	✅	Extracted with the `pdf-parse` adapter in `packages/api/modules/knowledge/lib/parsers.ts`
`application/vnd.openxmlformats-officedocument.wordprocessingml.document` (DOCX)	✅	`mammoth` adapter
`text/plain` (TXT)	✅	UTF-8 expected. Latin-1 and Windows-1251 fall back to a heuristic decode.
`text/markdown`	✅	Markdown is split chunk-aware (see strategies below).
`application/msword` (legacy `.doc`)	❌	Convert to DOCX first.
`application/rtf`	❌	Convert to TXT / DOCX first.
Scanned PDF (image-only)	❌	OCR is roadmap. PDFs must have a text layer.

File-size cap depends on the org plan (MAX_KNOWLEDGE_FILE_SIZE_BYTES entitlement). Defaults are 25 MB for paid plans, 5 MB for free.

Ingestion lifecycle

DataSource ──► IngestionJob (QUEUED)
                    │
                    ▼
            IngestionJob (RUNNING)
                    │
              parse + chunk + embed
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
KnowledgeDocument         (graph build)
        │
        ▼
KnowledgeChunk × N (+ embedding)
                    │
                    ▼
            IngestionJob (SUCCEEDED / FAILED)

Each IngestionJob row carries:

status — QUEUED → RUNNING → SUCCEEDED or FAILED.
totalItems — the parser's estimate of the work unit count (pages / sections).
processedItems, failedItems — running totals visible during execution.
errorMessage — populated on failure. Surfaced in the dashboard's Ingestion Jobs tab.

You can list jobs with knowledge.listIngestionJobs and inspect a single job's inputMeta (the original file name, mime type, bytes) to diagnose failures.

Retry semantics

Failed jobs are not auto-retried. The recommended pattern:

Inspect errorMessage in the dashboard.
Fix the source (re-encode the PDF, expand the URL, swap a corrupt DOCX) or the env (LLM quota exhausted).
Re-run by re-uploading or calling ingestFile / ingestUrl again. The second run creates a new IngestionJob and re-uses the existing DataSource.

Re-ingest replaces the previous KnowledgeDocument (matched by (knowledgeSpaceId, externalId)) and triggers a chunk + embedding refresh. Old chunks are deleted via the documentId cascade.

Chunking strategies

Chunking is implemented in packages/api/modules/knowledge/lib/chunking.ts. The strategy decides how the parsed text is sliced before embedding:

Strategy	Best for	How it splits
`fixed`	Default. Mixed-format documents.	Fixed word count per chunk with configurable overlap.
`semantic`	Long prose, articles.	Splits on paragraph + sentence boundaries; falls back to fixed when the segment is too long.
`markdown`	Markdown / Confluence-style docs.	Splits on heading levels. Keeps each section together when possible.
`code`	Source code documentation, API references.	Splits on function / class boundaries. Aware of triple-fenced code blocks.

Configurable per call (or per space, Beta):

{
  strategy: "semantic",
  chunkSize: 350,      // target words per chunk
  minChunkSize: 60,    // merge anything smaller than this
  maxChunkSize: 600,
  overlap: 50,         // overlapping words between adjacent chunks
}

Rule of thumb: 250–400 words per chunk with 30–50 words of overlap covers most prose. Larger chunks reduce retrieval count but blunt the relevance signal; smaller chunks improve relevance but inflate embedding cost.

Embedding

Each chunk's embedding is a Json array of floats sized to the configured model:

Model	`num_dim`	Status
`text-embedding-3-small`	1536	✅ Default
`text-embedding-3-large`	3072	🟡 Beta — per-space selection
Local hashing fallback	128	✅ Available — used when no API key is configured (dev mode only)

KnowledgeSpace.ragConfig.embeddingModel overrides the default per space. The local 128-dim fallback (embedTextLocally in chunking.ts) is dev-only — it does not produce useful retrieval; configure a real model before going to production.

Switching the embedding model invalidates all existing chunk embeddings. After changing it, trigger a full re-ingest of every source in the space.

Ingesting a file

const job = await orpc.knowledge.ingestFile.call({
  spaceId: "ks_…",
  fileName: "handbook.pdf",
  mimeType: "application/pdf",
  bytes: pdfArrayBuffer,
});
// → { ingestionJobId, dataSourceId }

Poll knowledge.listIngestionJobs for status, or subscribe to the dashboard SSE channel.

Ingesting a URL

await orpc.knowledge.ingestUrl.call({
  spaceId: "ks_…",
  url: "https://example.com/docs/payments",
  // optional:
  cssSelector: "main article",   // narrow what to extract
});

Public, HTML-only URLs only — no JS execution, no auth headers, no SPA-rendered content. Pages behind login are roadmap (connectors).

Deleting sources and files

Action	API	Effect
Delete a single file/document	`knowledge.deleteFile({ documentId })`	Cascades to all its `KnowledgeChunk` rows.
Delete a data source	`knowledge.deleteSource({ dataSourceId })`	Cascades to all `KnowledgeDocument` and chunks under it.
Delete a space	`knowledge.deleteSpace({ spaceId })`	Cascades to all sources, documents, chunks, graph nodes, edges.

Cascades are at the Postgres FK level (see onDelete: Cascade in schema.prisma) — deletion is durable and irreversible. Confirm before exposing in your UI.

Quotas

Per-org quotas come from the entitlements package:

Quota	Field on plan
Max spaces per org	`entitlements.knowledge.maxSpaces`
Max documents per space	`entitlements.knowledge.maxDocsPerSpace`
Max bytes per file	`entitlements.knowledge.maxBytesPerFile`
Monthly embedding budget	`entitlements.knowledge.monthlyEmbeddingKopecks`

Quota exhaustion returns 402 Payment Required and emits a dashboard quota_warning activity event at 80 % of the monthly budget.

Knowledge RAG overview
Evaluation — measuring quality after ingest
GraphRAG entity model — what buildGraphFromChunks writes
Plans and limits
Knowledge module & admin — dashboard view

Knowledge Sources

On this page