AACsearch
Knowledge RAG

Knowledge Sources

Supported source types, ingestion lifecycle, chunking strategies, and sync behavior for the Knowledge module.

A data source describes where a Knowledge space's documents come from. The Knowledge module currently supports two kinds out of the box — file upload and URL fetch — with connector-driven sources (Confluence, Notion, Google Drive) on the roadmap.

Source types

sourceTypeStatusWhat it ingestsSync behavior
FILE_UPLOAD✅ AvailablePDF, DOCX, TXT, MarkdownOne-shot — re-upload to refresh
URL_FETCH✅ AvailableA single web page (HTML → text)One-shot — re-ingest to refresh
CONFLUENCE⏳ RoadmapConfluence pages (per space / per label)Scheduled, incremental
NOTION⏳ RoadmapNotion pages and databasesScheduled, incremental
GOOGLE_DRIVE⏳ RoadmapDrive folders (docs, slides, sheets)Scheduled, incremental
SHAREPOINT⏳ RoadmapSharePoint document librariesScheduled, incremental

The full enum lives in packages/database/prisma/schema.prisma as KnowledgeSourceType. Roadmap entries are placeholders — they don't yet have a DataSource row, parser, or sync worker.

Supported file formats (FILE_UPLOAD)

FormatStatusNotes
application/pdfExtracted with the pdf-parse adapter in packages/api/modules/knowledge/lib/parsers.ts
application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)mammoth adapter
text/plain (TXT)UTF-8 expected. Latin-1 and Windows-1251 fall back to a heuristic decode.
text/markdownMarkdown is split chunk-aware (see strategies below).
application/msword (legacy .doc)Convert to DOCX first.
application/rtfConvert to TXT / DOCX first.
Scanned PDF (image-only)OCR is roadmap. PDFs must have a text layer.

File-size cap depends on the org plan (MAX_KNOWLEDGE_FILE_SIZE_BYTES entitlement). Defaults are 25 MB for paid plans, 5 MB for free.

Ingestion lifecycle

DataSource ──► IngestionJob (QUEUED)


            IngestionJob (RUNNING)

              parse + chunk + embed

        ┌───────────┴───────────┐
        ▼                       ▼
KnowledgeDocument         (graph build)


KnowledgeChunk × N (+ embedding)


            IngestionJob (SUCCEEDED / FAILED)

Each IngestionJob row carries:

  • statusQUEUED → RUNNING → SUCCEEDED or FAILED.
  • totalItems — the parser's estimate of the work unit count (pages / sections).
  • processedItems, failedItems — running totals visible during execution.
  • errorMessage — populated on failure. Surfaced in the dashboard's Ingestion Jobs tab.

You can list jobs with knowledge.listIngestionJobs and inspect a single job's inputMeta (the original file name, mime type, bytes) to diagnose failures.

Retry semantics

Failed jobs are not auto-retried. The recommended pattern:

  1. Inspect errorMessage in the dashboard.
  2. Fix the source (re-encode the PDF, expand the URL, swap a corrupt DOCX) or the env (LLM quota exhausted).
  3. Re-run by re-uploading or calling ingestFile / ingestUrl again. The second run creates a new IngestionJob and re-uses the existing DataSource.

Re-ingest replaces the previous KnowledgeDocument (matched by (knowledgeSpaceId, externalId)) and triggers a chunk + embedding refresh. Old chunks are deleted via the documentId cascade.

Chunking strategies

Chunking is implemented in packages/api/modules/knowledge/lib/chunking.ts. The strategy decides how the parsed text is sliced before embedding:

StrategyBest forHow it splits
fixedDefault. Mixed-format documents.Fixed word count per chunk with configurable overlap.
semanticLong prose, articles.Splits on paragraph + sentence boundaries; falls back to fixed when the segment is too long.
markdownMarkdown / Confluence-style docs.Splits on heading levels. Keeps each section together when possible.
codeSource code documentation, API references.Splits on function / class boundaries. Aware of triple-fenced code blocks.

Configurable per call (or per space, Beta):

{
  strategy: "semantic",
  chunkSize: 350,      // target words per chunk
  minChunkSize: 60,    // merge anything smaller than this
  maxChunkSize: 600,
  overlap: 50,         // overlapping words between adjacent chunks
}

Rule of thumb: 250–400 words per chunk with 30–50 words of overlap covers most prose. Larger chunks reduce retrieval count but blunt the relevance signal; smaller chunks improve relevance but inflate embedding cost.

Embedding

Each chunk's embedding is a Json array of floats sized to the configured model:

Modelnum_dimStatus
text-embedding-3-small1536✅ Default
text-embedding-3-large3072🟡 Beta — per-space selection
Local hashing fallback128✅ Available — used when no API key is configured (dev mode only)

KnowledgeSpace.ragConfig.embeddingModel overrides the default per space. The local 128-dim fallback (embedTextLocally in chunking.ts) is dev-only — it does not produce useful retrieval; configure a real model before going to production.

Switching the embedding model invalidates all existing chunk embeddings. After changing it, trigger a full re-ingest of every source in the space.

Ingesting a file

const job = await orpc.knowledge.ingestFile.call({
  spaceId: "ks_…",
  fileName: "handbook.pdf",
  mimeType: "application/pdf",
  bytes: pdfArrayBuffer,
});
// → { ingestionJobId, dataSourceId }

Poll knowledge.listIngestionJobs for status, or subscribe to the dashboard SSE channel.

Ingesting a URL

await orpc.knowledge.ingestUrl.call({
  spaceId: "ks_…",
  url: "https://example.com/docs/payments",
  // optional:
  cssSelector: "main article",   // narrow what to extract
});

Public, HTML-only URLs only — no JS execution, no auth headers, no SPA-rendered content. Pages behind login are roadmap (connectors).

Deleting sources and files

ActionAPIEffect
Delete a single file/documentknowledge.deleteFile({ documentId })Cascades to all its KnowledgeChunk rows.
Delete a data sourceknowledge.deleteSource({ dataSourceId })Cascades to all KnowledgeDocument and chunks under it.
Delete a spaceknowledge.deleteSpace({ spaceId })Cascades to all sources, documents, chunks, graph nodes, edges.

Cascades are at the Postgres FK level (see onDelete: Cascade in schema.prisma) — deletion is durable and irreversible. Confirm before exposing in your UI.

Quotas

Per-org quotas come from the entitlements package:

QuotaField on plan
Max spaces per orgentitlements.knowledge.maxSpaces
Max documents per spaceentitlements.knowledge.maxDocsPerSpace
Max bytes per fileentitlements.knowledge.maxBytesPerFile
Monthly embedding budgetentitlements.knowledge.monthlyEmbeddingKopecks

Quota exhaustion returns 402 Payment Required and emits a dashboard quota_warning activity event at 80 % of the monthly budget.

On this page