AUTOMATION · TRANSMISSION
Ingestion Pipelines
Ingestion is the process of turning volatile inputs into queryable, versioned knowledge. It is not “uploading files”; it is building a pipeline from chaos to structure.
To be effective, we must understand what ingestion is not:
- Not “storage”: Storing a PDF is not ingestion. Ingestion means the content becomes searchable and retrievable.
- Not “search”: Full-text search is not ingestion. Ingestion includes semantic understanding via embeddings.
- Not “one-time”: A one-time import is not ingestion. Ingestion implies a refresh cadence and version tracking.
The Four Layers: Storage vs Index vs Retrieval vs Answering
Before building, understand the distinction between layers:
| Layer | Purpose | Technology Examples |
|---|---|---|
| Storage | Raw files persist here | S3, local filesystem, Google Drive |
| Index | Content becomes searchable | Vector DB (Pinecone, Chroma), SQLite FTS5 |
| Retrieval | Relevant chunks are fetched | Similarity search, BM25, hybrid |
| Answering | LLM synthesizes a response | OpenAI, Anthropic, local models |
Each layer has different constraints. Storage is cheap; index is expensive. Retrieval is fast; answering is slow. Conflating these layers leads to brittle systems.
Minimal Reference Architecture
A “RAG for one” pipeline has seven stages. Each stage transforms the data:
Source Connectors
|
v
Parse + Normalize
|
v
Chunking Strategy
|
v
Metadata Tagging
|
v
Embedding / Index
|
v
Retrieval
|
v
Answering (LLM)
Stage 1: Source Connectors
Where does knowledge come from? Common sources:
- PDFs: Whitepapers, manuals, research papers
- Docs sites: Official documentation (Azure, AWS, framework docs)
- Code repos: README files, inline comments, architecture docs
- Internal wikis: Confluence, Notion, Obsidian vaults
- Communication: Slack threads, email threads (high noise, use sparingly)
Each source requires a connector. A connector’s job is: fetch, detect changes, and queue for processing.
Stage 2: Parse + Normalize
Raw content is messy. Normalization ensures consistency:
- PDFs: Extract text (PyPDF2, pdfplumber), handle tables separately
- HTML: Strip navigation, ads, and boilerplate (trafilatura, readability)
- Markdown: Preserve structure, extract code blocks as separate chunks
- Code: Parse AST for function-level chunking (tree-sitter)
Normalization is where you strip noise. A 50-page PDF with 5 pages of actual content should yield 5 pages of signal.
Stage 3: Chunking Strategy
Chunking determines what the LLM “sees” at retrieval time. Bad chunking = bad answers.
Three rules of thumb:
-
Chunk by semantic boundary, not character count: Prefer paragraph breaks, section headers, or function definitions over arbitrary 512-token splits. A chunk should contain one complete idea.
-
Overlap is insurance, not a solution: 10-20% overlap helps when boundaries are imperfect. But if you need 50% overlap, your chunking strategy is wrong.
-
Metadata travels with the chunk: Every chunk must know its source (file path, URL), position (page number, section), and version (last modified date). Without this, citations are impossible.
Stage 4: Metadata Tagging
Metadata enables filtering and citation. Minimum required fields:
| Field | Purpose | Example |
|---|---|---|
source_path | Where the chunk came from | /docs/api/authentication.md |
source_url | Original URL (if applicable) | https://docs.example.com/... |
chunk_index | Position in the document | 3 (third chunk) |
doc_version | Version or last-modified date | 2026-02-14 or v2.3.1 |
section | Document section (if extractable) | Authentication > JWT |
Without metadata, retrieval returns “something about authentication.” With metadata, retrieval returns “from the JWT section of the API docs, last updated February 2026.”
Stage 5: Embedding / Index
Embeddings convert text to vectors. The index stores them for similarity search.
Vendor-agnostic options:
- Local-first: Chroma, LanceDB, SQLite with sqlite-vss
- Cloud: Pinecone, Weaviate, Qdrant
- No vector DB: Use BM25 (keyword search) or hybrid approaches
For personal knowledge bases, local-first is often sufficient. A 10,000-chunk index fits in memory on a modern laptop.
Stage 6: Retrieval
Retrieval is the query-time operation. Given a question, find relevant chunks.
Retrieval strategies:
- Pure vector: Cosine similarity between query and chunk embeddings
- Hybrid: Vector + BM25 (captures both semantic and keyword matches)
- Filtered: Apply metadata filters before similarity search (e.g., “only docs from 2026”)
- Re-ranking: Retrieve more chunks than needed, then re-rank with a cross-encoder
The right strategy depends on your data. Dense technical docs benefit from hybrid. Short, distinct articles may need only vector search.
Stage 7: Answering (LLM)
The LLM synthesizes retrieved chunks into an answer. This is where citations matter.
Citation requirements:
- Every claim must reference a source chunk
- Citations should include source path and section
- If no chunk supports the claim, the LLM should say “I don’t have information about this”
Prompt the LLM explicitly:
Answer the question using ONLY the provided context chunks.
After each claim, cite the source in brackets: [source_path, section].
If the context doesn't contain the answer, say "I don't have information about this."
Evaluation: How to Know It Works
An ingestion pipeline without tests is a black box. When you change chunking strategy or add new sources, how do you know quality improved?
The Golden Questions Method
Create a set of 10 questions with known answers. These are your regression tests.
Example golden questions:
1. What is the maximum JWT token expiry recommended in our API docs?
Expected answer: 24 hours
Expected source: /docs/api/authentication.md > JWT Configuration
2. How do I reset a user's password via the CLI?
Expected answer: `portia auth reset-password --user <email>`
Expected source: /docs/cli/authentication.md > Password Reset
3. What is the rate limit for the free tier?
Expected answer: 100 requests/minute
Expected source: /docs/api/rate-limits.md > Tier Limits
Run these questions through your pipeline after every change. Track:
- Answer accuracy: Does the answer match expected?
- Source correctness: Does retrieval find the right document?
- Hallucination rate: Does the LLM invent information not in the chunks?
Regression Checks for Source Updates
When source documents change, run targeted tests:
- New content detection: Did the pipeline ingest the new version?
- Stale content removal: Did the pipeline remove or update outdated chunks?
- Golden question stability: Do existing answers still work?
Spotting Hallucination
Hallucination occurs when the LLM generates claims not supported by retrieved chunks. Detection:
- Missing citations: If the LLM makes a claim without a citation, flag it
- Citation verification: Check that cited chunks actually exist and contain the claim
- Confidence calibration: Low-confidence answers should be labeled as uncertain
Security & Privacy
Ingestion pipelines handle sensitive data. Three non-negotiable rules:
1. Don’t Ingest Secrets
API keys, passwords, and tokens should never enter the pipeline. Pre-filter (a Context Hygiene requirement, not a nice-to-have):
# Patterns to exclude
- `.env` files
- Files containing `API_KEY`, `SECRET`, `PASSWORD`
- Configuration files with credentials
2. Separate Private Documents
If your pipeline covers both public and private docs, maintain separate indexes:
- Public index: Can be shared, deployed to cloud
- Private index: Local-only, access-controlled
Never mix them. A query on the public index should never surface private content. If an agent is involved, treat this as a hard boundary and enforce it with a Safety Valve (e.g., tool allowlists + approval gates).
3. Beware Proprietary PDFs
Uploading proprietary PDFs (vendor contracts, licensed research) to cloud embedding services may violate terms. Options:
- Local embeddings: Run embedding models locally (Ollama, llama.cpp)
- Redaction: Strip sensitive sections before ingestion
- Contract review: Check if your license permits third-party processing
Portia Grounding: How We Use This
At Portia Labs, the ingestion pipeline concept maps directly to our workflow:
- Shipping surface is
/site: This is what we deploy. Everything else (/docs,/specs,/intel) is working context. - Specs as canonical decisions:
/specs/*.mdfiles are the “pinned truths” for the project (see Prompt Engineering). They should be ingested first when building context about the codebase. - Intel as knowledge base: /intel/ markdown files are the knowledge base you’re reading now. They’re designed to be ingested and queried.
When an AI agent works in this repo, the ingestion priority is:
- Read the spec for the current task
- Check existing patterns in
/site - Reference
/intelfor conceptual guidance
This is a manual ingestion pipeline: human-curated, version-controlled, and queryable.
Templates
Source Manifest Template
Track your sources and refresh cadence:
# source-manifest.yaml
sources:
- name: "API Documentation"
type: "docs_site"
url: "https://docs.example.com"
connector: "trafilatura"
refresh: "weekly"
last_ingested: "2026-02-14"
- name: "Internal Wiki"
type: "notion"
database_id: "abc123"
connector: "notion-api"
refresh: "daily"
last_ingested: "2026-02-14"
- name: "Code Repository"
type: "github"
repo: "org/repo"
paths: ["README.md", "docs/", "src/**/*.md"]
connector: "github-api"
refresh: "on_commit"
last_ingested: "2026-02-14"
Golden Questions Template
Create your evaluation set:
# Golden Questions
## Purpose
Regression tests for the ingestion pipeline. Run after every pipeline change.
## Questions
### Q1: [Question text]
- **Expected answer**: [What the answer should be]
- **Expected source**: [Which document/section]
- **Last verified**: [Date]
### Q2: [Question text]
- **Expected answer**: [What the answer should be]
- **Expected source**: [Which document/section]
- **Last verified**: [Date]
## Scoring
- Pass: Answer matches expected AND source is correct
- Partial: Answer correct but wrong source
- Fail: Answer incorrect or hallucinated
An ingestion pipeline is not a feature; it is infrastructure. Build it once, maintain it continuously, and evaluate it rigorously. The goal is not to ingest everything, but to make what you ingest reliably queryable.
Related Intel
- Digital Archaeology: How to Recover a Project You Don’t Understand
- Context Hygiene: How to Keep LLM Work High-Signal
- Agent-to-Agent Protocol: Formatting Knowledge for Your Fleet
- Soul Files: Persistent Memory for Serious Work
- VIB Single-Cell Analysis: Pipeline Case Study
- The Question Latch: Forcing Specification Before Ingestion
- Safety Valve: Protecting Your Pipeline and Data
Work with Portia Labs
If you want help applying this in your own environment:
- Remote Dev Latency Clinic — find the real source of jitter/lag, tune capture + encode + network, and leave with a written plan.
- Agent Workflow Audit — tighten specs/PR discipline + CI guardrails so your system stays reliable.
Explore Our Services | Contact Us
Drafted by Jarvis for Portia Labs.