← Back to Intel

AUTOMATION · TRANSMISSION

Ingestion Pipelines

Feb 14, 2026 / Lenny & Jarvis

Ingestion is the process of turning volatile inputs into queryable, versioned knowledge. It is not “uploading files”; it is building a pipeline from chaos to structure.

To be effective, we must understand what ingestion is not:

  • Not “storage”: Storing a PDF is not ingestion. Ingestion means the content becomes searchable and retrievable.
  • Not “search”: Full-text search is not ingestion. Ingestion includes semantic understanding via embeddings.
  • Not “one-time”: A one-time import is not ingestion. Ingestion implies a refresh cadence and version tracking.

The Four Layers: Storage vs Index vs Retrieval vs Answering

Before building, understand the distinction between layers:

LayerPurposeTechnology Examples
StorageRaw files persist hereS3, local filesystem, Google Drive
IndexContent becomes searchableVector DB (Pinecone, Chroma), SQLite FTS5
RetrievalRelevant chunks are fetchedSimilarity search, BM25, hybrid
AnsweringLLM synthesizes a responseOpenAI, Anthropic, local models

Each layer has different constraints. Storage is cheap; index is expensive. Retrieval is fast; answering is slow. Conflating these layers leads to brittle systems.

Minimal Reference Architecture

A “RAG for one” pipeline has seven stages. Each stage transforms the data:

Source Connectors
       |
       v
  Parse + Normalize
       |
       v
  Chunking Strategy
       |
       v
   Metadata Tagging
       |
       v
  Embedding / Index
       |
       v
    Retrieval
       |
       v
  Answering (LLM)

Stage 1: Source Connectors

Where does knowledge come from? Common sources:

  • PDFs: Whitepapers, manuals, research papers
  • Docs sites: Official documentation (Azure, AWS, framework docs)
  • Code repos: README files, inline comments, architecture docs
  • Internal wikis: Confluence, Notion, Obsidian vaults
  • Communication: Slack threads, email threads (high noise, use sparingly)

Each source requires a connector. A connector’s job is: fetch, detect changes, and queue for processing.

Stage 2: Parse + Normalize

Raw content is messy. Normalization ensures consistency:

  • PDFs: Extract text (PyPDF2, pdfplumber), handle tables separately
  • HTML: Strip navigation, ads, and boilerplate (trafilatura, readability)
  • Markdown: Preserve structure, extract code blocks as separate chunks
  • Code: Parse AST for function-level chunking (tree-sitter)

Normalization is where you strip noise. A 50-page PDF with 5 pages of actual content should yield 5 pages of signal.

Stage 3: Chunking Strategy

Chunking determines what the LLM “sees” at retrieval time. Bad chunking = bad answers.

Three rules of thumb:

  1. Chunk by semantic boundary, not character count: Prefer paragraph breaks, section headers, or function definitions over arbitrary 512-token splits. A chunk should contain one complete idea.

  2. Overlap is insurance, not a solution: 10-20% overlap helps when boundaries are imperfect. But if you need 50% overlap, your chunking strategy is wrong.

  3. Metadata travels with the chunk: Every chunk must know its source (file path, URL), position (page number, section), and version (last modified date). Without this, citations are impossible.

Stage 4: Metadata Tagging

Metadata enables filtering and citation. Minimum required fields:

FieldPurposeExample
source_pathWhere the chunk came from/docs/api/authentication.md
source_urlOriginal URL (if applicable)https://docs.example.com/...
chunk_indexPosition in the document3 (third chunk)
doc_versionVersion or last-modified date2026-02-14 or v2.3.1
sectionDocument section (if extractable)Authentication > JWT

Without metadata, retrieval returns “something about authentication.” With metadata, retrieval returns “from the JWT section of the API docs, last updated February 2026.”

Stage 5: Embedding / Index

Embeddings convert text to vectors. The index stores them for similarity search.

Vendor-agnostic options:

  • Local-first: Chroma, LanceDB, SQLite with sqlite-vss
  • Cloud: Pinecone, Weaviate, Qdrant
  • No vector DB: Use BM25 (keyword search) or hybrid approaches

For personal knowledge bases, local-first is often sufficient. A 10,000-chunk index fits in memory on a modern laptop.

Stage 6: Retrieval

Retrieval is the query-time operation. Given a question, find relevant chunks.

Retrieval strategies:

  • Pure vector: Cosine similarity between query and chunk embeddings
  • Hybrid: Vector + BM25 (captures both semantic and keyword matches)
  • Filtered: Apply metadata filters before similarity search (e.g., “only docs from 2026”)
  • Re-ranking: Retrieve more chunks than needed, then re-rank with a cross-encoder

The right strategy depends on your data. Dense technical docs benefit from hybrid. Short, distinct articles may need only vector search.

Stage 7: Answering (LLM)

The LLM synthesizes retrieved chunks into an answer. This is where citations matter.

Citation requirements:

  • Every claim must reference a source chunk
  • Citations should include source path and section
  • If no chunk supports the claim, the LLM should say “I don’t have information about this”

Prompt the LLM explicitly:

Answer the question using ONLY the provided context chunks.
After each claim, cite the source in brackets: [source_path, section].
If the context doesn't contain the answer, say "I don't have information about this."

Evaluation: How to Know It Works

An ingestion pipeline without tests is a black box. When you change chunking strategy or add new sources, how do you know quality improved?

The Golden Questions Method

Create a set of 10 questions with known answers. These are your regression tests.

Example golden questions:

1. What is the maximum JWT token expiry recommended in our API docs?
   Expected answer: 24 hours
   Expected source: /docs/api/authentication.md > JWT Configuration

2. How do I reset a user's password via the CLI?
   Expected answer: `portia auth reset-password --user <email>`
   Expected source: /docs/cli/authentication.md > Password Reset

3. What is the rate limit for the free tier?
   Expected answer: 100 requests/minute
   Expected source: /docs/api/rate-limits.md > Tier Limits

Run these questions through your pipeline after every change. Track:

  • Answer accuracy: Does the answer match expected?
  • Source correctness: Does retrieval find the right document?
  • Hallucination rate: Does the LLM invent information not in the chunks?

Regression Checks for Source Updates

When source documents change, run targeted tests:

  1. New content detection: Did the pipeline ingest the new version?
  2. Stale content removal: Did the pipeline remove or update outdated chunks?
  3. Golden question stability: Do existing answers still work?

Spotting Hallucination

Hallucination occurs when the LLM generates claims not supported by retrieved chunks. Detection:

  • Missing citations: If the LLM makes a claim without a citation, flag it
  • Citation verification: Check that cited chunks actually exist and contain the claim
  • Confidence calibration: Low-confidence answers should be labeled as uncertain

Security & Privacy

Ingestion pipelines handle sensitive data. Three non-negotiable rules:

1. Don’t Ingest Secrets

API keys, passwords, and tokens should never enter the pipeline. Pre-filter (a Context Hygiene requirement, not a nice-to-have):

# Patterns to exclude

- `.env` files
- Files containing `API_KEY`, `SECRET`, `PASSWORD`
- Configuration files with credentials

2. Separate Private Documents

If your pipeline covers both public and private docs, maintain separate indexes:

  • Public index: Can be shared, deployed to cloud
  • Private index: Local-only, access-controlled

Never mix them. A query on the public index should never surface private content. If an agent is involved, treat this as a hard boundary and enforce it with a Safety Valve (e.g., tool allowlists + approval gates).

3. Beware Proprietary PDFs

Uploading proprietary PDFs (vendor contracts, licensed research) to cloud embedding services may violate terms. Options:

  • Local embeddings: Run embedding models locally (Ollama, llama.cpp)
  • Redaction: Strip sensitive sections before ingestion
  • Contract review: Check if your license permits third-party processing

Portia Grounding: How We Use This

At Portia Labs, the ingestion pipeline concept maps directly to our workflow:

  • Shipping surface is /site: This is what we deploy. Everything else (/docs, /specs, /intel) is working context.
  • Specs as canonical decisions: /specs/*.md files are the “pinned truths” for the project (see Prompt Engineering). They should be ingested first when building context about the codebase.
  • Intel as knowledge base: /intel/ markdown files are the knowledge base you’re reading now. They’re designed to be ingested and queried.

When an AI agent works in this repo, the ingestion priority is:

  1. Read the spec for the current task
  2. Check existing patterns in /site
  3. Reference /intel for conceptual guidance

This is a manual ingestion pipeline: human-curated, version-controlled, and queryable.

Templates

Source Manifest Template

Track your sources and refresh cadence:

# source-manifest.yaml
sources:
  - name: "API Documentation"
    type: "docs_site"
    url: "https://docs.example.com"
    connector: "trafilatura"
    refresh: "weekly"
    last_ingested: "2026-02-14"

  - name: "Internal Wiki"
    type: "notion"
    database_id: "abc123"
    connector: "notion-api"
    refresh: "daily"
    last_ingested: "2026-02-14"

  - name: "Code Repository"
    type: "github"
    repo: "org/repo"
    paths: ["README.md", "docs/", "src/**/*.md"]
    connector: "github-api"
    refresh: "on_commit"
    last_ingested: "2026-02-14"

Golden Questions Template

Create your evaluation set:

# Golden Questions

## Purpose

Regression tests for the ingestion pipeline. Run after every pipeline change.

## Questions

### Q1: [Question text]

- **Expected answer**: [What the answer should be]
- **Expected source**: [Which document/section]
- **Last verified**: [Date]

### Q2: [Question text]

- **Expected answer**: [What the answer should be]
- **Expected source**: [Which document/section]
- **Last verified**: [Date]

## Scoring

- Pass: Answer matches expected AND source is correct
- Partial: Answer correct but wrong source
- Fail: Answer incorrect or hallucinated

An ingestion pipeline is not a feature; it is infrastructure. Build it once, maintain it continuously, and evaluate it rigorously. The goal is not to ingest everything, but to make what you ingest reliably queryable.



Work with Portia Labs

If you want help applying this in your own environment:

  • Remote Dev Latency Clinic — find the real source of jitter/lag, tune capture + encode + network, and leave with a written plan.
  • Agent Workflow Audit — tighten specs/PR discipline + CI guardrails so your system stays reliable.

Explore Our Services | Contact Us

Drafted by Jarvis for Portia Labs.